By Gildas Ehrmann https://github.com/gildas84/
The ongoing digital transformation is enabling a new range of opportunities to experiment with new tools and approaches to policy making. Such new tools promise to equip policy makers with granular, timely and accurate information on existing Science, Technology and Innovation (STI) ecosystems and trends. Alongside increasing data on STI policies, the possibilities to explore, analyse and visualise policy data are becoming affordable and accessible to all. With the complexification of data, new techniques to analyse large qualitative data have emerged, such as natural language processing (NLP). NLP involves the automatic process of reviewing and analysing textual data to derive meaning from this qualitative data in a way that is quantifiable. It involves building a large corpus of documents (groups of texts) that are of interest, to be analysed using computerised semantic tools to automatically read this text and generate insights by identifying key words and how these stand out in the broader context. This analysis aims to explore how such an approach can be used to build evidence to inform decision-making.
The 2021 edition of the STIP Compass database provides a wealth of information on more than 6500 policy initiatives for 57 countries and the European Union. However, the portal’s current interactive dashboards do not facilitate summarising and benchmarking the types of policies countries have introduced. As a result, users of the portal can be overwhelmed with the amount of information. There is an appetite from policy makers and policy analysts to get country overviews in specific policy areas, leveraging on that rich dataset, and to have the ability to compare countries against one another. To address this limitation, this analysis explores how unsupervised machine learning can be used to synthesise text information and to identify key patterns in the data.
The objectives of this pilot are fourfold. Using the data reported in the surveys feeding the 2017, 2019 & 2021 STIP Compass datasets, this pilot aims at:
This analysis takes the form of a Python notebook that includes code snippets used to identify the main topics hidden in policy initiatives' short descriptions and objectives. It visualises how these topics have been emphasised by countries across two editions of the STIP Compass dataset (2017 and 2021). This analysis is conducted in the context of broader work of the Working Party on Innovation and Technology Policy (TIP) around new policy tools in support for STI policy. To feed such work, it focuses on the “Innovation in firms and innovative entrepreneurship” policy area of the STIP Compass database. This analysis could be replicated and extended to other policy areas covered by the database (e.g. governance, public research system).
This notebook is structured as follows. Section 2 describes how the STIP Compass dataset is downloaded and made ready for analysis. Section 3 details the pre-processing necessary ahead of computerised semantic analysis. Section 4 provides summary statistics of the policy data reported in the “Innovation in firms and innovative entrepreneurship” policy area. Section 5 provides the backbone for the computerised semantic analysis, using Latent Dirichlet Allocation method to identify the main topics and how these are distributed across policy initiatives. Section 6 includes a number of visualisations to compare how countries emphasise such topics in their policies and checks whether such emphasis has varied over time. Lastly, Section 7 reflects on the outcomes and limitations of this pilot analysis, together with possible next steps.
This notebook works directly on the combined 2017, 2019 and 2021 STIP Compass databases as published by the STIP Data Lab. This first section retrieves the dataset from the web and conducts a few preliminary operations to trim and structure the Python dataframe. In particular, the relevant data is trimmed, arranged and formatted. It produces an extract of essential columns and introduces a few variables used throughout the analysis.
The three datasets (2017, 2019 and 2021) are downloaded and merged into a single dataframe (compass_df). Some corrections are made on variables including cleaning the fields "InitiativeID" or aggregating the “CountryLabel”/”CountryCode” for Belgian reporting entities (for this country, the database contains separate information for five administrative authorities). Additionally, duplicate initiatives, occurring naturally when multiple “InstrumentID” are linked to a single “InitiativeID”, are removed. A field (“INN”) is created to identify and filter initiatives relevant to the “Innovation in firms and innovative entrepreneurship” policy area. In the STIP Compass datasets, initiatives under this policy area are linked to one or more of the themes below. Additional information on how the STIP Compass datasets are structured is available in the STIP Data Lab page.
List of policy themes in the “Innovation in firms and innovative entrepreneurship” policy area of the STIP Compass dataset:
With the scope of initiatives defined, a few additional data transformations are performed: The six objectives and short description fields are concatenated into one field, to build the raw corpus for NLP. The raw corpus is kept as it is except for three Colombian initiatives which are not translated in English and are thus filtered out, and three Mexican initiatives’ whose name must be corrected in 2017. Finally, and an extra field is added to store countries' OECD membership status (OECD countries and non-OECD countries).
This notebook focuses its comparative analysis on seven OECD countries (Australia, Canada, France, Germany, Italy, Korea and the United Kingdom) and one non-OECD country (Brazil). These countries are selected as the quality of their data is relatively higher compared to that of others in the database and/or they are engaged in ongoing projects led by the Committee for Scientific and Technological Policy (CSTP). The countries to include in the analysis are stored in a dataframe (my_countries). Country selection can be changed at will from the list of 38 OECD countries and 20+ non-OECD countries and territories participating in the STIP Compass database, although the degree and quality of reporting varies across countries.
These preliminary steps mentioned above are conducted in the code cell below (click to expand/collapse).
# 2.1 Import necessary libraries:
# Generic libraries
import numpy as np
import pandas as pd
from pprint import pprint
#import os
#import plotly.graph_objects as go
#from plotly.subplots import make_subplots
#import tqdm
# Pre-processing libraries
import spacy # spacy for lemmatization
import re # to handle strings
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
# Plotting tools
import pyLDAvis
#import pyLDAvis.gensim # don't skip this - oldversion
import pyLDAvis.gensim_models # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline
# Misc
#import warnings
#warnings.filterwarnings("ignore",category=DeprecationWarning)
# 2.2 Import the datasets
compass_df2021 = pd.read_csv(r'C:\Users\Gildas\OneDrive\OECD\Datasets\STIP_Survey-2021 - Copy3.csv', sep='|', encoding='UTF-8-SIG', header=0, low_memory=False)
compass_df2021.drop(index=compass_df2021.index[0],
axis=0,
inplace=True)
compass_df2019 = pd.read_csv(r'C:\Users\Gildas\OneDrive\OECD\Datasets\STIP_Survey-2019.csv', sep='|', encoding='UTF-8-SIG', header=0, low_memory=False)
compass_df2019.drop(index=compass_df2019.index[0],
axis=0,
inplace=True)
compass_df2017 = pd.read_csv(r'C:\Users\Gildas\OneDrive\OECD\Datasets\STIP_Survey-2017.csv', sep='|', encoding='UTF-8-SIG', header=0, low_memory=False)
compass_df2017.drop(index=compass_df2017.index[0],
axis=0,
inplace=True)
compass_df2017.rename(columns={'CoutryLabel': 'CountryLabel'}, inplace=True)
# There is no CountryCode in 2017 database - let's add it back based on 2021 data
list_CountryCode = compass_df2021[['CountryLabel', 'CountryCode']].drop_duplicates().reset_index(inplace = False).copy()
list_CountryCode = list_CountryCode.drop(['index'], axis=1)
compass_df2017 = compass_df2017.drop(['CountryCode'], axis=1)
compass_df2017 = pd.merge(compass_df2017, list_CountryCode, on ='CountryLabel')
# 2.3 Merge and format the dataframe
# merge the 3 surveys into a single dataset
compass_df = pd.merge(compass_df2021, compass_df2019, how='outer')
compass_df = pd.merge(compass_df, compass_df2017, how='outer')
# First sets of treatment on data:
compass_df['Objectives1'] = compass_df['Objectives1'].fillna('')
compass_df['Objectives2'] = compass_df['Objectives2'].fillna('')
compass_df['Objectives3'] = compass_df['Objectives3'].fillna('')
compass_df['Objectives4'] = compass_df['Objectives4'].fillna('')
compass_df['Objectives5'] = compass_df['Objectives5'].fillna('')
compass_df['Objectives6'] = compass_df['Objectives6'].fillna('')
compass_df['ShortDescription'] = compass_df['ShortDescription'].fillna('')
compass_df.Tags.fillna("¬", inplace=True)
compass_df.YearlyBudgetRange.fillna("Don't know", inplace=True)
compass_df.drop(index=compass_df.index[0],
axis=0,
inplace=True)
# Shorten the initiatives ID, removing HTTP references
compass_df['InitiativeID'] = compass_df['InitiativeID'].str.replace("http://stip.oecd.org/2021/data/policyInitiatives/", "", regex=True).str.strip()
compass_df['InitiativeID'] = compass_df['InitiativeID'].str.replace("http://stip.oecd.org/2019/data/policyInitiatives/", "", regex=True).str.strip()
compass_df['InitiativeID'] = compass_df['InitiativeID'].str.replace("http://stip.oecd.org/2017/data/policyInitiatives/", "", regex=True).str.strip()
# The case of Belgium: unify reporting entities under one country
compass_df['CountryLabel'] = compass_df['CountryLabel'].str.replace("Belgium - Brussels Capital", "Belgium").str.strip()
compass_df['CountryLabel'] = compass_df['CountryLabel'].str.replace("Belgium - Federal government", "Belgium").str.strip()
compass_df['CountryLabel'] = compass_df['CountryLabel'].str.replace("Belgium - Flanders", "Belgium").str.strip()
compass_df['CountryLabel'] = compass_df['CountryLabel'].str.replace("Belgium - Wallonia", "Belgium").str.strip()
compass_df['CountryLabel'] = compass_df['CountryLabel'].str.replace("Belgium - Wallonia-Brussels Federation", "Belgium").str.strip()
compass_df['CountryLabel'] = compass_df['CountryLabel'].str.replace("Belgium-Brussels Federation", "Belgium").str.strip()
compass_df['CountryCode'] = compass_df['CountryCode'].str.replace("BEBRU", "BEL").str.strip()
compass_df['CountryCode'] = compass_df['CountryCode'].str.replace("BEFED", "BEL").str.strip()
compass_df['CountryCode'] = compass_df['CountryCode'].str.replace("BEVLG", "BEL").str.strip()
compass_df['CountryCode'] = compass_df['CountryCode'].str.replace("BEWAL", "BEL").str.strip()
compass_df['CountryCode'] = compass_df['CountryCode'].str.replace("BEWBF", "BEL").str.strip()
compass_df['CountryCode'] = compass_df['CountryCode'].str.replace("BEWBF", "BEL").str.strip()
# Ensure all the data is in the right format:
for x in range(0, compass_df.shape[1]):
compass_df.iloc[:,x] = compass_df.iloc[:,x].astype(int, errors='ignore')
# 2.4 Complement the dataset:
# Introduce a variable called 'INN' to filter initatives related to “Innovation in firms and innovative entrepreneurship” policy area:
# Field TH82 isn't populated in all surveys
compass_df_INN = compass_df[['TH30', 'TH31', 'TH32', 'TH33', 'TH34', 'TH35', 'TH36', 'TH37', 'TH38', 'TH82']].copy()
compass_df_INN = compass_df_INN.fillna(0).copy()
compass_df_INN['INN'] = pd.to_numeric(compass_df_INN.TH30) + pd.to_numeric(compass_df_INN.TH31) + pd.to_numeric(compass_df_INN.TH32) + pd.to_numeric(compass_df_INN.TH33) + pd.to_numeric(compass_df_INN.TH34) + pd.to_numeric(compass_df_INN.TH35) + pd.to_numeric(compass_df_INN.TH36) + pd.to_numeric(compass_df_INN.TH37) + pd.to_numeric(compass_df_INN.TH38) + pd.to_numeric(compass_df_INN.TH82)
compass_df['INN'] = compass_df_INN['INN'].astype(int)
for x in range(0, compass_df_INN.shape[1]):
compass_df_INN.iloc[:,x] = compass_df_INN.iloc[:,x].astype(int, errors='ignore')
# Remove non-INN data:
index_names = compass_df[compass_df['INN'] < 0.3 ].index
compass_df = compass_df.drop(index_names, inplace = False)
# Add column 'OECD_STATUS' on OECD membership
OECD_STATUS_COUNTRY = ['AUS','AUT','BEBRU','BEFED','BEVLG','BEWAL','BEWBF','BEL','CAN','CHL','COL','CRI','CZE','DNK','EST','FIN','FRA','DEU','GRC','HUN','ISL','IRL','ISR','ITA','JPN','KOR','LVA','LTU','LUX','MEX','NLD','NZL','NOR','POL','PRT','SVK','SVN','ESP','SWE','CHE','TUR','GBR','USA']# = 'OECD country'
compass_df.loc[(compass_df['CountryCode'] == 'AUS') | (compass_df['CountryCode'] == 'AUT') | (compass_df['CountryCode'] == 'BEBRU') | (compass_df['CountryCode'] == 'BEFED') | (compass_df['CountryCode'] == 'BEVLG') | (compass_df['CountryCode'] == 'BEWAL') | (compass_df['CountryCode'] == 'BEWBF') | (compass_df['CountryCode'] == 'BEL') | (compass_df['CountryCode'] == 'CAN') | (compass_df['CountryCode'] == 'CHL') | (compass_df['CountryCode'] == 'COL') | (compass_df['CountryCode'] == 'CRI') | (compass_df['CountryCode'] == 'CZE') | (compass_df['CountryCode'] == 'DNK') | (compass_df['CountryCode'] == 'EST') | (compass_df['CountryCode'] == 'FIN') | (compass_df['CountryCode'] == 'FRA') | (compass_df['CountryCode'] == 'DEU') | (compass_df['CountryCode'] == 'GRC') | (compass_df['CountryCode'] == 'HUN') | (compass_df['CountryCode'] == 'ISL') | (compass_df['CountryCode'] == 'IRL') | (compass_df['CountryCode'] == 'ISR') | (compass_df['CountryCode'] == 'ITA') | (compass_df['CountryCode'] == 'JPN') | (compass_df['CountryCode'] == 'KOR') | (compass_df['CountryCode'] == 'LVA') | (compass_df['CountryCode'] == 'LTU') | (compass_df['CountryCode'] == 'LUX') | (compass_df['CountryCode'] == 'MEX') | (compass_df['CountryCode'] == 'NLD') | (compass_df['CountryCode'] == 'NZL') | (compass_df['CountryCode'] == 'NOR') | (compass_df['CountryCode'] == 'POL') | (compass_df['CountryCode'] == 'PRT') | (compass_df['CountryCode'] == 'SVK') | (compass_df['CountryCode'] == 'SVN') | (compass_df['CountryCode'] == 'ESP') | (compass_df['CountryCode'] == 'SWE') | (compass_df['CountryCode'] == 'CHE') | (compass_df['CountryCode'] == 'TUR') | (compass_df['CountryCode'] == 'GBR') | (compass_df['CountryCode'] == 'USA'), 'OECD_STATUS'] = 'OECD country'
compass_df.loc[(compass_df['OECD_STATUS'] != 'OECD country'), 'OECD_STATUS'] = 'Non-OECD country'
# stop_words_acronym: remove acronyms from dataset
stop_words_acronym = compass_df['Acronym'].copy()
stop_words_acronym = pd.DataFrame(stop_words_acronym)
stop_words_acronym = stop_words_acronym.dropna()
stop_words_acronym = stop_words_acronym.drop_duplicates()
stop_words_acronym = stop_words_acronym['Acronym'].tolist()
# Filter out the columns we dont need:
compass_df_full = compass_df # save all data
compass_df = compass_df[['InitiativeID', 'SurveyYear', 'OECD_STATUS', 'NameEnglish', 'CountryLabel', 'CountryCode', 'ShortDescription', 'Objectives1', 'Objectives2', 'Objectives3', 'Objectives4', 'Objectives5', 'Objectives6', 'YearlyBudgetRange', 'INN']].copy() #, , 'coefficient','NBW']].copy()
compass_df['IIDYear'] = compass_df['InitiativeID'].astype(str) + "-" + compass_df['SurveyYear'].astype(str)
# Remove duplicate initiatives:
compass_df = compass_df.drop_duplicates().reset_index(inplace = False)
compass_df.drop(['index'], axis=1, inplace=True)
# Country selection: What country should the analysis be on?
# Select your countries (CountryLabel)
my_countries = ("Australia", "Brazil", "Italy", "Korea", "United Kingdom", "Germany", "France", "Canada")
my_countries = pd.DataFrame(my_countries)
my_countries = my_countries.rename(columns = {0: "CountryLabel"})
my_countries = my_countries.sort_values(["CountryLabel"]).copy()
# Filter the list of countries CountryLabel/CountryCode
all_countries = compass_df[['CountryLabel', 'CountryCode']].copy()
all_countries = all_countries[['CountryLabel', 'CountryCode']].drop_duplicates().reset_index(inplace = False)
all_countries["Aggregate"] = "No"
all_countries = all_countries[['CountryLabel', 'CountryCode', 'Aggregate']]
# Import Countrycode in my_countries - this dataframe will be called later in coordination with the main dataframe
my_countries = pd.merge(my_countries,
all_countries,
on ='CountryLabel',
how ='inner')
new_row = pd.DataFrame([['OECD Median', 'OECD', 'Yes'], ['OECD Mean', 'OECD2', 'Yes']], columns=['CountryLabel', 'CountryCode', 'Aggregate'])
my_countries = my_countries.append(new_row, ignore_index=True)
all_countries = all_countries.append(new_row, ignore_index=True)
# concatenate all objectives into one string
compass_df['Objectives'] = compass_df['ShortDescription'] + " " + compass_df['Objectives1'] + " " + compass_df['Objectives2'] + " " + compass_df['Objectives3'] + " " + compass_df['Objectives4'] + " " + compass_df['Objectives5'] + " " + compass_df['Objectives6']
# Specific case handling!!
compass_df.loc[(compass_df['InitiativeID'] == 2483) & (compass_df['SurveyYear'] == 2017) & (compass_df['CountryCode'] == 'MEX'), 'NameEnglish'] = 'INNOVATEC'
compass_df.loc[(compass_df['InitiativeID'] == 2484) & (compass_df['SurveyYear'] == 2017) & (compass_df['CountryCode'] == 'MEX'), 'NameEnglish'] = 'PROINNOVA'
compass_df.loc[(compass_df['InitiativeID'] == 2486) & (compass_df['SurveyYear'] == 2017) & (compass_df['CountryCode'] == 'MEX'), 'NameEnglish'] = 'INNOVAPYME'
The pre-processing stage is necessary to prepare the corpus adequately for the semantic analysis. First the data is formatted to simplify the raw corpus, then two dictionaries are applied to replace expressions by unique expressions, and finally, the corpus is filtered to exclude certain words or types of words. The result of the pre-processing is a corpus matrix.
A first set of pre-processing tasks are performed to streamline the corpus, without altering its core. The data is transformed to lowercase, stripped of non-alphanumeric characters, then stripped of non-standard characters. Small words under three letters, numbers and extra whitespaces are removed. Additionally, some common American English formulations are changed into their British English equivalent.
Two dictionaries of vocabulary are used to replace synonyms or expressions that are equivalent. This is important to facilitate the identification of keywords. For example, firm, enterprise and company are interchangeable words and are more easily be captured as keywords if only the term firm is used. The first dictionary of vocabulary is the OECD STIP Data Lab’s vocabulary, which references common STI terminologies, acronyms and expressions, along with favoured expressions. A second dictionary of vocabulary is more generic and references common words and common bigrams and trigrams observed in this corpus. Both dictionaries of vocabulary are applied in the same way to substitute multiple terms and concepts with unique terms.
The first two steps have formatted the corpus and harmonised equivalent terms and expressions some common terms and expressions. The final pre-processing step filters the corpus by the nature of the words.
Once the corpus is lemmatised, each initiative is described by a list of words, which can then feed the computerised semantic analysis.
# 2.1 vocabulary of profiles' dictionary
# 1)import the file
vop = pd.read_excel(r'C:\Users\Gildas\OneDrive\OECD\Datasets\STIP vocabulary.xlsx', sheet_name='Existing topics', header=0, usecols=[0,1,2])
vop = vop.dropna()
vop['English alternate labels'] = vop['English alternate labels'].str.split('\n')
vop = vop.explode("English alternate labels", ignore_index=False)
# 2) Add ' ' before and after each string in vop to ensure the acronyms are not replaced when string is found in another word. Otherwise: "NN" would be replace in "Innovation" by "Ineural networkvation"
vop['English alternate labels'] = ' ' + vop['English alternate labels'] + ' '
vop['English prefered label'] = ' ' + vop['English prefered label'] + ' '
# make sure the english prefered label is using british english i/o american english (listo btained from http://www.tysto.com/uk-us-spelling-list.html)
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('ization', 'isation', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('izing', 'ising', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('izement', 'isement', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('izer', 'iser', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('izable', 'isable', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('yzi', 'ysi', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('start ups', 'start_ups', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('programme', 'program', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('yze', 'yse', x))
vop['English prefered label'] = vop['English prefered label'].apply(lambda x: re.sub('izes', 'ises', x))
# 3) Lowercasing
vop['English alternate labels'] = vop['English alternate labels'].str.lower()
vop['English prefered label'] = vop['English prefered label'].str.lower()
# 4) create a dictionary from prefered and alternate labels
vop_dict = dict(zip(vop["English alternate labels"], vop["English prefered label"]))
# 2.2 alternative dictionary
weird_dico = {"-": " "}
vop2 = pd.read_excel(r'C:\Users\Gildas\OneDrive\OECD\Datasets\Terms vocabulary.xlsx', sheet_name='Bigrams', header=0, usecols=[0,1])
vop2 = vop2.dropna()
vop2_dict = dict(zip(vop2["WORDS_OTHER"], vop2["WORDS_MAIN"]))
# 0) Ensure all the data is in the string format:
compass_df_TR = compass_df.loc[(compass_df['InitiativeID'] != 99994505) & (compass_df['SurveyYear'] != 99994074) & (compass_df['SurveyYear'] != 99994495) & (compass_df['SurveyYear'] != 99994504) & (compass_df['SurveyYear'] != 15681)].copy()
compass_df_reduced = compass_df_TR[['IIDYear', 'Objectives']].copy()
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].astype(str, errors='ignore')
# 1) Lowercasing
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].str.lower()
# 2) english_uniformisation (#here: towards british english)
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('ization', 'isation', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('izing', 'ising', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('izement', 'isement', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('izer', 'iser', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('izable', 'isable', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('yzi', 'ysi', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('start ups', 'start_ups', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('programme', 'program', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('yze', 'yse', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub('izes', 'ises', x))
# 3) Removal of non-alphanumeric characters (URLs, tags, etc)
# 3.1 remove non ASCII characters
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].str.encode('ascii', 'ignore').str.decode('ascii')
# 3.2 apply patterns
pattern2 = r'\b[A-Za-z0-9]{1,3}\b'
pattern3 = r'\b[0-9(\-)]{1,}\b'
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub(pattern2, '', x))
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda x: re.sub(pattern3, '', x))
# 4) Removal Extra Whitespaces
def remove_whitespace(text):
return " ".join(text.split())
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].apply(lambda z:remove_whitespace(z))
# 5) apply weird dico
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].replace(weird_dico, regex=True)
# 6) apply vop dico
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].replace(vop_dict, regex=True)
# 7) apply vop2 dico
compass_df_reduced['Objectives'] = compass_df_reduced['Objectives'].replace(vop2_dict, regex=True)
# 1) Convert the dataframe into list of lists of words
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True, min_len=3, max_len=150))
data = compass_df_reduced['Objectives'].values.tolist()
data_words = list(sent_to_words(data))
# 2) Remove Stop Words
def remove_stopwords(texts):
return [[word for word in gensim.utils.simple_preprocess(str(doc), deacc=True, min_len=3, max_len=150) if word not in stopwords] for doc in texts]
stopwords = spacy.load('en_core_web_sm').Defaults.stop_words
stopwords.update(['wallonia','american', 'euro', 'Česká', 'czech republic', 'republic', 'israeli', 'zealand', 'czech', 'finland', 'nordic', 'russian', 'russia', 'poland', 'polish', 'luxembourg', 'german', 'germany', 'italy', 'italian', 'spanish', 'portuguese', 'portugal', 'spain', 'france', 'french', 'australia', 'australian', 'british', 'irish', 'ireland', 'danish', 'denmark', 'croatia' 'croatian', 'austrian', 'austria', 'swedish', 'sweden', 'norwegian', 'norway', 'greek', 'greece', 'turkish', 'baltic', 'lithuania', 'lithuanian', 'estonia', 'estonian', 'latvia', 'latvian', 'hungary', 'hungarian', 'slovakian', 'slovakia', 'japan', 'japanese', 'korea', 'korean', 'chile', 'chilean', 'argentina', 'argentine', 'brasilian', 'brazil', 'brazilian', 'mexican', 'mexico', 'costa-rica', 'costa-rican', 'switzerland', 'swiss', 'netherlands', 'dutch', 'netherland', 'belgium', 'belgian', 'flemish', 'colombian', 'colombia', 'thailand', 'cyprus', 'canada', 'canadian', 'peru', 'peruvian', 'slovenian', 'slovenia', 'iceland', 'icelandic', 'romania', 'romanian', 'morocco', 'moroccan', 'israel', 'bulgarian', 'bulgaria', 'china', 'chinese', 'vietnamese', 'india','indian', 'saudi', 'serbian', 'serbia', 'egypt', 'egyptian', 'uruguay', 'uruguayan', 'indonesia', 'indonesian', 'kenya', 'kenyan', 'european', 'brussels', 'leuven', 'santander', 'walloon', 'costa', 'rican', 'canadas', 'oecd', 'kazakhstan', 'african', 'federation', 'vlaanderen', 'malaysian', 'malaysia', 'montenegro', 'cypriot', 'kazakh', 'europe', 'italy', 'italys', 'croatia', 'thai', 'malta', 'africas', 'italia', 'northern', 'southern', 'eastern', 'western', 'east', 'west', 'south', 'north', 'shanghai', 'munich', 'new delhi', 'seoul', 'boston', 'tel aviv', 'moscow', 'finnish', 'ukraine', 'ukrainian','germanys','world','hellenic','turkeys'])
stopwords.update(['recently','hour', 'today', 'tomorrow', 'yesterday', 'later', 'multiannual', 'annual', 'number', 'period', 'january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'twelve', 'day', 'daily', 'time', 'timely', 'monthly', 'month', 'annual', 'year', 'general', 'actual', 'or', 'and', 'last', 'other', 'half', 'well', 'thousand'])
stopwords.update(['partially', 'majority', 'mainly', 'currently', 'additionally', 'specifically', 'shall', 'manner', 'timespan', 'objectively', 'worth', 'certain', 'mandatory', 'example', 'likely', 'irrespective', 'instead', 'approximately', 'furthermore', 'http', 'especially', 'newly', 'previous', 'prior', 'broad', 'billion', 'million', 'previously', 'typically', 'primarily', 'hereinafter', 'significant', 'particular','great', 'bold', 'major','this','that','which','beyond','behind','between','from','to','also', 'actual', 'otherwise', 'rather', 'altogether', 'effectively', 'some', 'along', 'such', 'with', 'through', 'first', 'second', 'from', 'into', 'less', 'more', 'main', 'other', 'same', 'than', 'that', 'this', 'their', 'they', 'them', 'whole', 'while', 'whilst', 'without', 'various', 'without', 'consist', 'correspond', 'comprise'])
stopwords.update(['rtdi', 'ttgv','qbic', 'vito', 'imec', 'UGent', 'ULG', 'EFSI', 'csiro', 'sbir', 'anpcyt', 'innosuisse','mcast', 'ires', 'irap', 'ukti', 'finpro', 'fraunhofer', 'dimpt', 'mctic', 'innoviris', 'encti', 'rdec', 'sitr', 'kosgeb', 'icex', 'alibaba', 'ebay', 'etsy', 'innovfin', 'wbso', 'erdf', 'cdti', 'bpifrance', 'rda', 'tubitak','vinnovas','esif','efsi','sebrae','bmbf','sifide'])
stopwords.update(['szchenyi', 'contrato', 'prestamo', 'banco', 'interamericano', 'desarrollo', 'financiar', 'programa', 'innovacion', 'humano', 'competitividad'])
stopwords.update(stop_words_acronym)
data_words = remove_stopwords(data_words)
# 3) Lemmatize to keep only allowed word types # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'AUX', 'X', 'PROPN' 'SPACE']):
# """https://spacy.io/api/annotation"""
# remove 'DET', 'ADP', 'NUM', 'SYM', 'PRON', 'CONJ', 'CCONJ', 'PUNCT', 'SCONJ', 'INTJ', 'PART',
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']) # alternative: "en_core_web_md"
data_lemmatized2 = lemmatization(data_words)
This section summarises the policies reported under the Innovation in firms and innovative entrepreneurship policy area, one of the six core areas covered by the STIP Compass dataset. It includes:
Figure 1 displays the evolution of initiatives dedicated to the Innovation in firms and innovative entrepreneurship policy area, for each of the selected countries and by survey edition.
In the three editions of the survey, participating countries have reported a growing number of initiatives in favour of innovation in firms and innovative entrepreneurship. On average, OECD countries report more initiatives than non-OECD countries, although the gap in reporting has narrowed in the more recent editions.
All the selected countries have reported a higher number of initiatives than the OECD median, except Italy (by only a small margin). Australia, Canada, Germany and Korea have accentuated their reporting of initiatives well above OECD median in 2021. Germany sharply increased reporting with 140% more initiatives in 2021 compared to 2019. Australia and the United Kingdom reported a stable number of initiatives, although higher than OECD median.
This increase in reported initiatives and the various dynamics may be due to a clear focus put on this policy area by the policy maker but could also be explained by changes in the reporting style. We account for the latter possibility in the analysis (Section 5.3).
#retrieve a copy of the main dataset
compass_df0 = compass_df.reset_index().copy()
# count initiatives by country by survey
compass_df0['INN'] = 1
compass_df0 = pd.pivot_table(data=compass_df0, index=['SurveyYear','OECD_STATUS','CountryLabel'], values=['INN'], aggfunc={'INN': pd.Series.sum}, fill_value=0).reset_index()
# Pivot the dataframe to wide format, and generate a second dataframe for OECD and non-OECD countries medians
compass_df01 = pd.pivot_table(data=compass_df0, index=['SurveyYear'], columns=['CountryLabel'], values=['INN'], aggfunc={'INN': pd.Series.median}, fill_value=0).reset_index()
compass_df02 = pd.pivot_table(data=compass_df0, index=['SurveyYear'], columns=['OECD_STATUS'], values=['INN'], aggfunc={'INN': pd.Series.median}, fill_value=0).reset_index()
compass_df01.columns = compass_df01.columns.droplevel(-2)
compass_df02.columns = compass_df02.columns.droplevel(-2)
compass_df01 = compass_df01.rename(columns = {'': 'Survey year'})
compass_df02 = compass_df02.rename(columns = {'': 'Survey year'})
# Import the selected countries and add the variable to the country dataframe
my_countries_list = my_countries[['CountryLabel']].copy()
my_countries_list["retain"] = 1
my_countries_list = my_countries_list.T
new_header = my_countries_list.iloc[-2]
my_countries_list = my_countries_list[1:].copy()
my_countries_list.columns = new_header
compass_df01 = compass_df01.append(my_countries_list)
# remove non seletected countries
compass_df01 = compass_df01.dropna(axis=1)
compass_df01.drop(['retain'], axis=0, inplace=True)
# merge the 2 dataframe with OECD median and selected
compass_df01 = pd.DataFrame(compass_df01)
compass_df02 = pd.DataFrame(compass_df02)
compass_df01 = compass_df01.reset_index()
compass_df02 = compass_df02.reset_index()
compass_df00 = pd.merge(compass_df01,
compass_df02,
on ='index',
how ='inner')
compass_df00.drop(['index'], axis=1, inplace=True)
# reshap the dataframe, drop unnecessary column and updat dates
compass_df00 = compass_df00.T.copy()
compass_df00 = compass_df00.rename(columns = {0: 2017, 1: 2019, 2: 2021})
compass_df00.drop(['Survey year'], axis=0, inplace=True)
# generate the graph:
import plotly.graph_objects as go
countries=compass_df00.index
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538'}
fig = go.Figure(data=[
go.Bar(name='2017', x=countries, y=compass_df00[2017], marker_color=colors['G']),
go.Bar(name='2019', x=countries, y=compass_df00[2019], marker_color=colors['D']),
go.Bar(name='2021', x=countries, y=compass_df00[2021], marker_color=colors['A'])
])
# Change the bar mode
fig.update_layout(
barmode='group',
title="Figure 1: Evolution of initiatives dedicated to Innovation in firms and innovative entrepreneurship policy area by selected countries and by survey",
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=800,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig1.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig1.png")
fig.show()
Table 1 displays the total words in the pre-processed corpus for OECD countries across each survey edition. The total size of the pre-processed corpus increased faster than the reported number of initiatives. While the number of OECD countries' initiatives increased by 27% between 2017 and 2021, the size of the corresponding corpus increased by over 68% over the same period.
Table 2 displays the total words forming the pre-processed corpus for all participant countries across each survey edition and displays a similar increase.
# obtain stats on pre-processed corpus by year:
def count_word(string):
return(len(string.strip().split(" ")))
compass_df5 = compass_df.reset_index().copy()
compass_df5['Count_word'] = compass_df5['Objectives'].apply(lambda z:count_word(z)) #compass_df2['ShortDescription'].apply(lambda z:count_word(z)) + compass_df2['Objectives1'].apply(lambda z:count_word(z)) + compass_df2['Objectives2'].apply(lambda z:count_word(z)) + compass_df2['Objectives3'].apply(lambda z:count_word(z)) + compass_df2['Objectives4'].apply(lambda z:count_word(z)) + compass_df2['Objectives5'].apply(lambda z:count_word(z)) + compass_df2['Objectives6'].apply(lambda z:count_word(z))
compass_df5 = compass_df5[(compass_df5.OECD_STATUS == "OECD country")]
compass_df5 = pd.pivot_table(data=compass_df5, index=['SurveyYear','CountryLabel'], values=['Count_word','InitiativeID'], aggfunc={'Count_word': pd.Series.sum,'InitiativeID': pd.Series.count}, fill_value=0).reset_index()
# obtain stats on pre-processed corpus by year:
def count_word(string):
return(len(string.strip().split(" ")))
compass_df2 = compass_df.reset_index().copy()
compass_df2['Count_word'] = compass_df2['Objectives'].apply(lambda z:count_word(z))
compass_df2 = compass_df2[(compass_df2.OECD_STATUS == "OECD country")]
compass_df2 = pd.pivot_table(data=compass_df2, index=['SurveyYear'], values=['Count_word','InitiativeID','CountryLabel'], aggfunc={'Count_word': pd.Series.sum,'InitiativeID': pd.Series.count,'CountryLabel': pd.Series.nunique}, fill_value=0).reset_index()
compass_df2 = compass_df2.rename(columns = {'SurveyYear': 'Survey year','Count_word': 'Total words in corpus', 'CountryLabel':'Participant countries','InitiativeID':'Total initiatives'})
# Generate a table
import plotly.figure_factory as ff
colorscale = [[0, '#272D31'],[.5, '#ffffff'],[1, '#ffffff']]
fig = ff.create_table(compass_df2, colorscale=colorscale)
fig.layout.width=800
fig.update_layout(title_text="Table 1: Description of corpus by survey (OECD countries only)",
title_font_family="verdana",
title_font_color="#000000",
title_font_size=15)
fig.update_layout({'margin':{'t':35}})
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\tab1.png")
fig.show()
# obtain stats on pre-processed corpus by year:
def count_word(string):
return(len(string.strip().split(" ")))
compass_df5 = compass_df.reset_index().copy()
compass_df5['Count_word'] = compass_df5['Objectives'].apply(lambda z:count_word(z)) #compass_df2['ShortDescription'].apply(lambda z:count_word(z)) + compass_df2['Objectives1'].apply(lambda z:count_word(z)) + compass_df2['Objectives2'].apply(lambda z:count_word(z)) + compass_df2['Objectives3'].apply(lambda z:count_word(z)) + compass_df2['Objectives4'].apply(lambda z:count_word(z)) + compass_df2['Objectives5'].apply(lambda z:count_word(z)) + compass_df2['Objectives6'].apply(lambda z:count_word(z))
compass_df5 = compass_df5[(compass_df5.OECD_STATUS == "OECD country")]
compass_df5 = pd.pivot_table(data=compass_df5, index=['SurveyYear','CountryLabel'], values=['Count_word','InitiativeID'], aggfunc={'Count_word': pd.Series.sum,'InitiativeID': pd.Series.count}, fill_value=0).reset_index()
# obtain stats on pre-processed corpus by year:
def count_word(string):
return(len(string.strip().split(" ")))
compass_df2 = compass_df.reset_index().copy()
compass_df2['Count_word'] = compass_df2['Objectives'].apply(lambda z:count_word(z))
#compass_df2 = compass_df2[(compass_df2.OECD_STATUS == "OECD country")]
compass_df2 = pd.pivot_table(data=compass_df2, index=['SurveyYear'], values=['Count_word','InitiativeID','CountryLabel'], aggfunc={'Count_word': pd.Series.sum,'InitiativeID': pd.Series.count,'CountryLabel': pd.Series.nunique}, fill_value=0).reset_index()
compass_df2 = compass_df2.rename(columns = {'SurveyYear': 'Survey year','Count_word': 'Total words in corpus', 'CountryLabel':'Participant countries','InitiativeID':'Total initiatives'})
# Generate a table
import plotly.figure_factory as ff
colorscale = [[0, '#272D31'],[.5, '#ffffff'],[1, '#ffffff']]
fig = ff.create_table(compass_df2, colorscale=colorscale)
fig.layout.width=800
fig.update_layout(title_text="Table 2: Description of corpus by survey (All participant countries)",
title_font_family="verdana",
title_font_color="#000000",
title_font_size=15)
fig.update_layout({'margin':{'t':35}})
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\tab2.png")
fig.show()
Figure 2 displays the average number of words used by countries to describe initiatives and to detail their objectives. Two points to note:
There is a certain cohesion between the corpus of the selected countries, with no country significantly over or under populating the objectives and short description fields.
# obtain stats on pre-processed corpus by year:
# define function count_word:
def count_word(string):
return(len(string.strip().split(" ")))
# obtain compass_df6 which sums corpus by country and survey year
compass_df6 = compass_df.reset_index().copy()
compass_df6['Count_word'] = compass_df6['Objectives'].apply(lambda z:count_word(z))
compass_df6 = pd.pivot_table(data=compass_df6, index=['SurveyYear','OECD_STATUS','CountryLabel'], values=['Count_word','InitiativeID'], aggfunc={'Count_word': pd.Series.sum,'InitiativeID': pd.Series.count}, fill_value=0).reset_index()
compass_df6["corpus"] = compass_df6["Count_word"].astype(float) / compass_df6["InitiativeID"].astype(float)
compass_df6["corpus"] = round(compass_df6["corpus"],0)
# compute OECD median and add OCD median
compass_df7 = compass_df6[(compass_df6.OECD_STATUS == "OECD country")].copy()
compass_df7 = pd.pivot_table(data=compass_df7, index=['SurveyYear','OECD_STATUS'], values=['Count_word','InitiativeID','corpus'], aggfunc={'Count_word': pd.Series.median,'InitiativeID': pd.Series.median,'corpus': pd.Series.median}, fill_value=0).reset_index()
new_row0 = {'SurveyYear': compass_df7['SurveyYear'][0],'OECD_STATUS': compass_df7['OECD_STATUS'][0], 'CountryLabel':'OECD Median', 'Count_word': compass_df7['Count_word'][0], 'InitiativeID': compass_df7['InitiativeID'][0], 'corpus': compass_df7['corpus'][0]}
new_row1 = {'SurveyYear': compass_df7['SurveyYear'][1],'OECD_STATUS': compass_df7['OECD_STATUS'][1], 'CountryLabel':'OECD Median', 'Count_word': compass_df7['Count_word'][1], 'InitiativeID': compass_df7['InitiativeID'][1], 'corpus': compass_df7['corpus'][1]}
new_row2 = {'SurveyYear': compass_df7['SurveyYear'][2],'OECD_STATUS': compass_df7['OECD_STATUS'][2], 'CountryLabel':'OECD Median', 'Count_word': compass_df7['Count_word'][2], 'InitiativeID': compass_df7['InitiativeID'][2], 'corpus': compass_df7['corpus'][2]}
compass_df6 = compass_df6.append(new_row0, ignore_index=True)
compass_df6 = compass_df6.append(new_row1, ignore_index=True)
compass_df6 = compass_df6.append(new_row2, ignore_index=True)
# filter on selected countrie alone
compass_df6 = pd.merge(compass_df6,
my_countries,
on ='CountryLabel',
how ='inner')
compass_df6 = compass_df6.drop(['CountryCode','Aggregate'], axis=1)
# obtain stats on pre-processed corpus by year:
# define function count_word:
def count_word(string):
return(len(string.strip().split(" ")))
# order the data (count words in compass_df61, #initiativesID in compass_df62)
compass_df61 = compass_df6.pivot(index='SurveyYear', columns='CountryLabel', values='Count_word')
compass_df61 = compass_df61.reset_index()
compass_df61.columns = compass_df61.columns.to_flat_index()
compass_df61 = compass_df61.drop(['SurveyYear'], axis=1)
compass_df62 = compass_df6.pivot(index='SurveyYear', columns='CountryLabel', values='InitiativeID')
compass_df62 = compass_df62.reset_index()
compass_df62.columns = compass_df62.columns.to_flat_index()
compass_df62 = compass_df62.drop(['SurveyYear'], axis=1)
compass_df63 = compass_df61/compass_df62
compass_df63 = round(compass_df63,1)
#sort the dataframe
compass_df63 = compass_df63.sort_values([2], axis=1).copy()
# generate graph
import plotly.graph_objects as go
country = compass_df63.columns[0:] # compass_df6['CountryLabel'].drop_duplicates()
Count_word_2017 = compass_df63.iloc[[0]].T
Count_word_2017 = Count_word_2017.reset_index()
Count_word_2017.rename(columns={0: 'Count_word'}, inplace=True)
Count_word_2017 = list(Count_word_2017["Count_word"])
Count_word_2019 = compass_df63.iloc[[1]].T
Count_word_2019 = Count_word_2019.reset_index()
Count_word_2019.rename(columns={1: 'Count_word'}, inplace=True)
Count_word_2019 = list(Count_word_2019["Count_word"])
Count_word_2021 = compass_df63.iloc[[2]].T
Count_word_2021 = Count_word_2021.reset_index()
Count_word_2021.rename(columns={2: 'Count_word'}, inplace=True)
Count_word_2021 = list(Count_word_2021["Count_word"])
InitiativeID_2017 = compass_df63.iloc[[0]]
InitiativeID_2019 = compass_df63.iloc[[1]]
InitiativeID_2021 = compass_df63.iloc[[2]]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Count_word_2017,
y=country,
name='2017 Corpus',
marker=dict(
color=colors['G'],
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Count_word_2019, y=country,
# name='2019 Corpus',
# marker=dict(
# color=colors['D'],
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Count_word_2021, y=country,
name='2021 Corpus',
marker=dict(
color=colors['A'],
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=16))
fig.update_layout(
title="Figure 2: Average number of words per initiative by selected countries",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=10,
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=800,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_xaxes(
title_text='Average number of words per initiative',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(40, 100),
#icktext=["","Topic less prevalent", "OECD Median", "Topic more prevalent"],
tickvals=[0, 60, 90],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig2.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig2.png")
fig.show()
Having established that countries use an increasingly similar number of words to describe an increasing pool of initiatives, we now look at the variety of words countries use in their reporting. To this end, we introduce the notion of relative lexical diversity.
Figure 3 displays the lexical diversity score for countries participating in the STIP Compass dataset relative to the total corpus. In practice, the score is obtained by calculating the ratio of (i) the number of unique words in a country's corpus; over (ii) the number of unique words used in the total corpus.
Countries whose official language is English tend to have a lexical diversity score above OECD median (except New Zealand), but several countries who do not use English have a higher score. We can reasonably infer that comparing corpuses between countries who may or may not officially use English is not biased. We also note that the selected countries have a higher lexical diversity compared to the OECD median, with the exception of Italy (which is slightly underneath).
### here we are using the pre-processed data - we must incorporate data transformed in here first back into the main datazet, before appluing the lexical diversity procedure
# 0. aggregate pre-processed objectives by country
# 1. compute the lexical diversity for the whole dataset
# 2. compute the lexical diversity for each country
# copy main dataset
compass_df["Objectives_preprocessed"] = compass_df_reduced["Objectives"]
compass_df2 = compass_df.reset_index().copy()
# copy pre-processed data in dataframe tests
tests = pd.DataFrame(data_words)
tests['Objectives_preprocessed'] = tests[tests.columns[0:]].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
tests = tests['Objectives_preprocessed'].reset_index()
# merge these two datasets
compass_df_reduced3 = pd.merge(compass_df2,
tests,
on ='index',
how ='inner')
compass_df_reduced3 = compass_df_reduced3.drop(['index', ], axis=1)
compass_df_reduced3 = compass_df_reduced3.fillna(0).copy()
# obtain a single string for all countries' initiaitves objectives in dataframe compass_df_reduced4:
compass_df_reduced4 = compass_df_reduced3[['Objectives_preprocessed_y']].copy()
compass_df_reduced4 = compass_df_reduced4.T
compass_df_reduced4['Objectives_preprocessed'] = compass_df_reduced4[compass_df_reduced4.columns[1:]].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
compass_df_reduced4 = compass_df_reduced4['Objectives_preprocessed'].reset_index()
# obtain a single string for each countries' initiaitves objectives in dataframe compass_df_reduced5:
compass_df_reduced5 = compass_df_reduced3[['Objectives_preprocessed_y', 'CountryCode']].reset_index().copy()
compass_df_reduced5=compass_df_reduced5.pivot(index='CountryCode', columns='index', values='Objectives_preprocessed_y')
compass_df_reduced5['Objectives_preprocessed'] = compass_df_reduced5[compass_df_reduced5.columns[1:]].apply(lambda x: ', '.join(x.dropna().astype(str)), axis=1)
compass_df_reduced5 = compass_df_reduced5['Objectives_preprocessed'].reset_index().copy()
## 3 generate statistics based on pre-processed objectives (compass_df_reduced3):
def count_word(string):
return(len(string.strip().split(" ")))
compass_df_reduced3['Count_word'] = compass_df_reduced3['Objectives_preprocessed_y'].apply(lambda z:count_word(z))
compass_df_stats = pd.pivot_table(data=compass_df_reduced3, index=['OECD_STATUS', 'CountryCode', 'CountryLabel'], values=['InitiativeID', 'Count_word'], aggfunc={'InitiativeID': pd.Series.count, 'Count_word': pd.Series.mean}, margins=False, margins_name='total', fill_value=0).reset_index()
compass_df_stats['corpus'] = compass_df_stats['InitiativeID'] * compass_df_stats['Count_word']
compass_df_stats['InitiativeID'] = compass_df_stats['InitiativeID'].astype(int)
compass_df_stats['corpus'] = compass_df_stats['corpus'].astype(int)
compass_df_stats['Count_word'] = (round(compass_df_stats['Count_word'],1).astype(float))
# merge our stats obtaines with the concatenated objectives obtained in compass_df_reduced5
compass_df_stats = pd.merge(compass_df_stats,
compass_df_reduced5,
on ='CountryCode',
how ='inner')
compass_df_stats = compass_df_stats.fillna(0).copy()
import nltk
import nltk.corpus
def lexical_diversity(text):
text = text.split(" ")
return len(set(text)) / len(sorted(set(compass_df_reduced4['Objectives_preprocessed'][0].split(" "))))
compass_df_stats['diversity'] = round(compass_df_stats['Objectives_preprocessed'].apply(lambda z:lexical_diversity(z)),3)
# generate data for OECD median
# Compute median in an extra row and append the row to dataframe
compass_df_stats_OECD = compass_df_stats[(compass_df_stats.OECD_STATUS == "OECD country")]
# OECD MEdian
new_row2 = {'OECD_STATUS':'OECD country', 'CountryLabel':'OECD Median', 'CountryCode':'OECD', 'Count_word': compass_df_stats_OECD['Count_word'].median(), 'InitiativeID': compass_df_stats_OECD['InitiativeID'].median(), 'corpus': compass_df_stats_OECD['corpus'].median(), 'Objectives_preprocessed': "",'diversity': compass_df_stats_OECD['diversity'].median()}
compass_df_stats = compass_df_stats.append(new_row2, ignore_index=True)
# select only the label of selected countries
compass_df_stats['CountryLabel2'] = compass_df_stats['CountryLabel'].isin(list(my_countries.CountryLabel))
compass_df_stats.loc[compass_df_stats['CountryLabel2'] == True, 'CountryLabel2'] = compass_df_stats['CountryLabel']
compass_df_stats.loc[compass_df_stats['CountryLabel2'] == False, 'CountryLabel2'] = ""
# sort the dataset and remove non OECD countries (+ Brazil)
compass_df_stats = compass_df_stats.sort_values(["diversity"]).copy()
compass_df_stats = compass_df_stats.loc[(compass_df_stats["OECD_STATUS"] == 'OECD country') | (compass_df_stats["CountryLabel"] == 'Brazil')]
# generate the graph
import plotly.graph_objects as go
country = compass_df_stats['CountryLabel']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
compass_df_stats['CountryLabel2'] = compass_df_stats['CountryLabel2']
compass_df_stats.loc[(compass_df_stats['CountryLabel2'] == ''), 'CountryLabel2'] = colors['G']
compass_df_stats.loc[(compass_df_stats['CountryLabel2'] != '#b49d80'), 'CountryLabel2'] = colors['A']
compass_df_stats.loc[(compass_df_stats['CountryLabel'] == 'OECD Median'), 'CountryLabel2'] = colors['Null']
country2 = compass_df_stats['CountryLabel2']
diversity = compass_df_stats["diversity"]
fig = go.Figure()
fig.add_trace(go.Scatter(
x=country,
y=diversity,
name='Corpus size (2017)',
marker=dict(
color=country2,
line_color=colors['O'],
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=16))
fig.update_layout(
title="Figure 3: Lexical diversity of total corpus by country",
yaxis=dict(
showgrid=True,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
tickformat=".0%",
dtick=0.1,
ticks='outside',
tickcolor="#000000",
),
xaxis=dict(
showgrid=True,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=800,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
title_text='Lexical diversity',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(0, 0.31),
tickvals=[0, 0.1, 0.2, 0.3],
showgrid=True,
gridcolor="#000000",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#000000",
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
fixedrange=True,
showgrid=True,
ticks="outside",
tickson="boundaries",
tickcolor="#000000",
tickfont=dict(family="verdana", color="#000000", size=11),
tickangle=-45
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig3.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig3.png")
fig.show()
This section provides the backbone for the computerised semantic analysis. In particular, it describes how the Latent Dirichlet Allocation (LDA) method can be used to identify topics and assess how countries emphasise such topics in the policy data they enter across different survey editions.
The latent dirichlet allocation (LDA) algorithm is a form of unsupervised machine learning that is commonly used for topic modelling in Natural Language Processing. It estimates a model that explains a set of documents in terms of unobserved topics. In other words, it classifies the documents under such topics. The input to the model is a corpus of individual documents each made of series of words (here: initiatives’ objectives). The LDA model uses the corpus as a training set to identify a given number of topics. In particular, from the composition and association of words in each individual document, the LDA model will infer a list of topics, defined each by a series of individual words observed frequently together and forming an assumed coherent association. LDA is a form of unsupervised machine learning as we do not need to supply the labels to the topic modelling algorithm for the identification of the topics.
Once calculated, the LDA model can be applied to score each individual document against each topic inferred by the model. In practice, the model will identify the presence of certain words or association of words and score an initiative according to the words that make up each of the topics. In this pilot, the model used all the corpus, but it is possible to define the model from a portion of the corpus and test it on the remaining portion to assess the model’s strength. For a more thorough introduction to LDA, see for example this article.
The number of topics the model will use is an input to the LDA model. For a given set of documents, there is not a single a priori number of topics. Rather, the algorithm fits the model against a target number of topics. A common approach to evaluating the most appropriate number of topics is to maximise the notion of coherence (Roeder M., Both A. & Hinneburg A., 2015). In practice, coherence will increase mathematically by increasing the number of topics, so optimising coherence may correspond to identifying an inflexion point which also provides interpretable topics. Additionally, the LDA model can be fine-tuned by changing hyper-parameters, alpha and beta:
For the number of topics N defined in input, the LDA topic model will provide by topics a list of words and weightings.
The PyLDAvis tool provides a visualisation that maps the N topics, where each topic appears proportional to its prevalence in the corpus. Hovering a given topic reveals its word cluster, and the weight each of these words has in the topic (comparing to the frequency of the word in the full corpus). Topics that overlap in the map are related as they share a number of common keywords.
From these elements, the analyst must interpret a label for each topic. Topics that are not useful to summarise the data can also be excluded from the analysis.
As mentioned earlier, a higher coherence does not always guarantee that topics will be made of an interpretable list of words. It is up to the analyst to interpret the list of words and their weights appropriately, or if failing, to select another model with alternative parameters (number of topics, alpha and beta). This was also the case in this pilot, which required testing different iterations parameters before testing the resulting topics and how the model was using them to classify initiatives. Furthermore, the small size of the corpus make the interpretation stage unstable: association between words and topics are likely highly sensitive to small changes in data. This means that we can expect topics to vary once a new edition of the database (scheduled for 2023) is included in this analysis. It is worth noting that while this affect the topic identification stage, this is unrelated to how the model performs in classifying initiatives.
# 5.1) LDA pre-requisites:
# Create Dictionary
#id2word = corpora.Dictionary(data_lemmatized2)
id2word = gensim.corpora.dictionary.Dictionary(data_words)
id2word.filter_extremes(no_below=5, no_above=0.75, keep_n=80000)
# A dictionary is a mapping of word ids to words. To create our dictionary, we can create a built in gensim.corpora.Dictionary object. From there, the filter_extremes() method is essential in order to ensure that we get a desirable frequency and representation of tokens in our dictionary.
# Create Corpus
texts = data_lemmatized2
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
## Optional: view
#
#print(corpus[:1])
#id2word[0]
## Human readable format of corpus (term-frequency)
#[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:2]]
# 5.2) Gensim LDA - Hyperparameters fine-tuning
# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=k,
random_state=100,
chunksize=100,
passes=10,
alpha=a,
eta=b)
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized2, dictionary=id2word, coherence='c_v')
return coherence_model_lda.get_coherence()
grid = {}
grid['Validation_Set'] = {}
# Topics range
min_topics = 5
max_topics = 30
step_size = 1
topics_range = range(min_topics, max_topics, step_size)
# Alpha parameter
alpha = list(np.arange(0.31, 0.61, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')
# Beta parameter
beta = list(np.arange(0.61, 0.61, 1))
beta.append('symmetric')
# Validation sets
num_of_docs = len(corpus)
corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25),
# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5),
#ensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)), #num_of_docs*0.75),
#corpus]
]
corpus_title = ['100% Corpus']#'75% Corpus', '100% Corpus']
model_results = {'Validation_Set': [],
'Topics': [],
'Alpha': [],
'Beta': [],
'Coherence': []
}
# Can take a long time to run
if 1 == 1:
# iterate through validation corpuses
for i in range(len(corpus_sets)):
# iterate through number of topics
for k in topics_range:
# iterate through alpha values
for a in alpha:
# iterate through beta values
for b in beta:
# get the coherence score for the given parameters
cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, k=k, a=a, b=b)
# Save the model results
model_results['Validation_Set'].append(corpus_title[i])
model_results['Topics'].append(k)
model_results['Alpha'].append(a)
model_results['Beta'].append(b)
model_results['Coherence'].append(cv)
pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
#pd.DataFrame(model_results).to_csv(r'C:\Users\Gildas\OneDrive\OECD\lda_tuning_results_NEW_TEST_Objectives2-50.csv', index=False)
### 5.3) Gensim LDA - Apply the model
# select characteristics of the models:
alpha_value = 0.28
beta_value = 0.57
topics_num = 17
# 1) Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=topics_num,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha=alpha_value,
eta=beta_value,
per_word_topics=True)
# 2) Compute Model Perplexity and Coherence Score
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized2, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
# 3) Optional: View the topics in LDA model
# Print the Keyword in the 10 topics
model_topics = lda_model.show_topics(formatted=True)
pprint(lda_model.print_topics(num_words=10))
# 4) Visualize the topics with pyLDAvis
pyLDAvis.enable_notebook()
#vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, sort_topics=False)
vis
Each model results in a list of terms with their respective contribution to each topic. One word can contribute to several topics. The word concentration is a factor that can be altered in the modelling phase (beta). The interpretation phase is central in the topics analysis. While coherence measures how fit a model is, it does not guarantee the interpretability of topics, and the model with the highest coherence score may not result in the most interpretable topics.
In this study, we’ve identified 7 topics with coherent bags of words, which, ordered by their prevalence in the corpus, could be referred to as such:
These topics are more prevalent in the corpus compared to others. They also cluster an interpretable bag of words and offer analytical power to summarise the policy objectives raised by initiatives supporting business innovation.
Conversely, some topics can be prevalent in the corpus , but often capture words around generic support for business innovation (Topic 14, Topic 2) and generic national strategies and plans (Topic 13). These are not helpful to summarise the data as intended. Other topics are less prevalent in the corpus and reflect apparent spurious word co-occurrences, not showing much coherence and interpretability between their key words (Topic 11, Topic 10, Topic 15, Topic 8, Topic 16, Topic 9).
The following two tables can be used to interpret topics and to check the model’s robustness. They detail the dominant topic for each initiative and the most representative initiative for each topic, respectively. We use extended versions of these tables to check that policy initiatives are correctly classified under the seven topics retained from the model. Alternatively, checking how many initiatives match the topic well is also a robustness check: If only a handful of initiatives contribute significantly to a given topic, this may limit its significance. One narrow topic was found to be interpretable (health care), but only present in a limited number of initiatives in the corpus (Topic 1)
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_words):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,6), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_words)
# retrieve information on initatives
df_topic_sents_keywords["CountryLabel"] = compass_df_TR["CountryLabel"]
df_topic_sents_keywords["NameEnglish"] = compass_df_TR["NameEnglish"]
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['InitiativeID', 'Dominant_Topic', 'Probability', 'Topic keywords', 'Corpus', 'Country', 'Name']
df_dominant_topic = df_dominant_topic.drop(['Corpus'], axis=1)
df_dominant_topic.head()
# Group top x sentences under each topic
df_sent_topics_sorted = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')
for i, grp in sent_topics_outdf_grpd:
df_sent_topics_sorted = pd.concat([df_sent_topics_sorted,
grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)],
axis=0)
# Reset Index
df_sent_topics_sorted.reset_index(drop=True, inplace=True)
# Format
df_sent_topics_sorted.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text", "Country", "Name"]
df_sent_topics_sorted = df_sent_topics_sorted.drop(['Text'], axis=1)
# Show
df_sent_topics_sorted.head()
This pilot aims at visualising the prevalence of each topic within each country's mix of initiatives, compared to other OECD countries and to the OECD median. The last step to apply the model is to evaluate each initiative against the topics, i.e. obtain how each initiative scores against each topic. A topic score is a value ranging between [0,1] that indicates the probability for the initiative to belong to a given topic. It can be interpreted as a measure of how much the initiative’s text emphasises a given topic. Such scores can be aggregated to obtain a topic score by country and survey edition. To achieve so, the following methodology:
(1) We calculate the absolute topic score for country X in year Y by adding initiatives’ individual scores. We identify the OECD median score for each topic.
(2) We normalise the values in (1) by the number of words reported by the country X in year Y. We do the same for the OECD median topic score. This normalisation accounts for country differences in reporting (Section 4).
(3) The normalised topic scores can be divided by the OECD median topic score. This allows comparing how prevalent topics are in countries relatively to the OECD median.
This methodology puts the emphasis on a topic's prevalence relative to other topics and to OECD median. Hence,
In theory, the above methodology could also be weighted by initiatives’ budgets. However, as we attempted this approach we have identified important inconsistencies in how countries report budget data. Such inconsistencies propagate into the results, making them unreliable.
As topic prevalence is not weighted by budget, this metric should not be interpreted as a measure of policy effort. Rather, topic prevalence indicates how more or less frequently countries raise a given topic in their policies (relative to others). It does not intend to evaluate performance, but rather to help characterise national policy mixes.
# 1) Generate a dataframe detailling topics contribution to each initiatives
# generate a dataframe to receive 'de- list-of-listed' data
central_rep = pd.DataFrame(columns = [0,1,2,3,4,5,6,7])
#import time
#start_time = time.time()
compass_df_reduced1 = compass_df_reduced.copy()
for x in range(0, compass_df_reduced1.shape[0]):
tests=lda_model[corpus][x][0] #ldamallet[corpus[x]]
s=pd.DataFrame(tests).T
s.columns=s.columns//2
pd.concat([pd.DataFrame(x.values) for _,x in s.groupby(level=0,axis=0)]).dropna(axis=0,thresh=1)
s.columns = s.iloc[0].astype(int)
s = s[1:]
#add result to a central dataframe
central_rep = central_rep.append(s, ignore_index=False)
#merge the weighting with original dataframes:
central_rep.reset_index(drop=True, inplace=True)
central_rep.reset_index(inplace=True)
compass_df_reduced1.reset_index(drop=True, inplace=True)
compass_df_reduced1.reset_index(inplace=True)
# merge topics scores with list of initiatives
compass_df_reduced2 = pd.merge(compass_df_reduced1,
central_rep,
on ='index',
how ='inner')
compass_df_reduced2 = compass_df_reduced2.fillna(0).copy()
compass_df_reduced2 = compass_df_reduced2.drop(['Objectives', 'index'], axis=1)
#now extend to compass_df (which contains initiatives over multiple years):
#compass_df_reduced2 = compass_df_reduced2.copy()
compass_df_reduced2 = compass_df_reduced2.rename(columns = {0: "Topic1",1: "Topic2",2: "Topic3",3: "Topic4",
4: "Topic5",5: "Topic6",6: "Topic7",7: "Topic8",
8: "Topic9",9: "Topic10",10: "Topic11",11: "Topic12",
12: "Topic13",13: "Topic14",14: "Topic15",15: "Topic16",
16: "Topic17",17: "Topic18",18: "Topic19",19: "Topic20",
20: "Topic21",21:"Topic22",22:"Topic23",23:"Topic24"})
compass_df_statics = compass_df[['OECD_STATUS', 'SurveyYear', 'IIDYear', 'CountryLabel', 'NameEnglish', 'Objectives', 'Objectives1', 'Objectives2', 'Objectives3', 'Objectives4', 'Objectives5', 'Objectives6', 'ShortDescription']].copy() #'Objectives1', 'Objectives2', 'Objectives3', 'Objectives4', 'Objectives5', 'Objectives6']].copy()
compass_df2 = pd.merge(compass_df_reduced2,
compass_df_statics,
on ='IIDYear',
how ='inner')
# print file
#compass_df2.to_excel(r'C:\Users\Gildas\OneDrive\OECD\test gensim\Breakdown_intitiatives_obj-YYYY.xlsx', index=False, sheet_name="Z")
# We obtained a dataframe detailling percentage of topics contribution to each initiatives
# We'll now normalise the data to facilitate comparison between countries
# 2) Normalisation procédure: Normalize topics contribution by country's (pre-processed) word count and topics' OECD median
# 2.1) word count normalisation
# Define a function to count word in a string
def count_word(string):
return(len(string.strip().split(" ")))
# Apply count_word function to dataframe
compass_df5 = compass_df.reset_index().copy()
compass_df5['Count_word'] = compass_df5['Objectives'].apply(lambda z:count_word(z)) #compass_df2['ShortDescription'].apply(lambda z:count_word(z)) + compass_df2['Objectives1'].apply(lambda z:count_word(z)) + compass_df2['Objectives2'].apply(lambda z:count_word(z)) + compass_df2['Objectives3'].apply(lambda z:count_word(z)) + compass_df2['Objectives4'].apply(lambda z:count_word(z)) + compass_df2['Objectives5'].apply(lambda z:count_word(z)) + compass_df2['Objectives6'].apply(lambda z:count_word(z))
# Genereate a table of total countr words by country by SurveyYear.
#compass_df5 = compass_df5[(compass_df5.OECD_STATUS == "OECD country")]
#compass_df_stats_OECD = compass_df_stats[(compass_df_stats.OECD_STATUS == "OECD country")]
compass_df5 = pd.pivot_table(data=compass_df5, index=['SurveyYear','OECD_STATUS','CountryLabel'], values=['Count_word','InitiativeID'], aggfunc={'Count_word': pd.Series.sum,'InitiativeID': pd.Series.count}, fill_value=0).reset_index()
compass_df5_non = compass_df5.drop(['OECD_STATUS'], axis=1)
# import word_counts into mann dataframe of initiatives
compass_df3 = pd.merge(compass_df2,
compass_df5_non,
on =['SurveyYear','CountryLabel'],
how ='inner')
# Normalise all topics countribution by count_word
for x in range(1, compass_df3.shape[1]-14): #15 #13
compass_df3.iloc[:,x] = compass_df3.iloc[:,x] / (compass_df3["Count_word"]/10000)
# print file
#compass_df3.to_excel(r'C:\Users\Gildas\OneDrive\OECD\test gensim\test_Count_word.xlsx', index=False, sheet_name="Z")
# Technical: add topics [18-24]
compass_df3["Topic18"] = 0
compass_df3["Topic19"] = 0
compass_df3["Topic20"] = 0
compass_df3["Topic21"] = 0
compass_df3["Topic22"] = 0
compass_df3["Topic23"] = 0
compass_df3["Topic24"] = 0
# sum topics score by country and filter out data not needed
compass_df_table = pd.pivot_table(data=compass_df3, index=['SurveyYear', 'OECD_STATUS', 'CountryLabel'], values=['Topic1','Topic2','Topic3','Topic4','Topic5','Topic6','Topic7','Topic8','Topic9','Topic10','Topic11','Topic12','Topic13','Topic14','Topic15','Topic16','Topic17','Topic18','Topic19','Topic20','Topic21','Topic22','Topic23','Topic24'], aggfunc={'Topic1': pd.Series.sum, 'Topic2': pd.Series.sum, 'Topic3': pd.Series.sum, 'Topic4': pd.Series.sum, 'Topic5': pd.Series.sum, 'Topic6': pd.Series.sum, 'Topic7': pd.Series.sum, 'Topic8': pd.Series.sum, 'Topic9': pd.Series.sum, 'Topic10': pd.Series.sum, 'Topic11': pd.Series.sum, 'Topic12': pd.Series.sum, 'Topic13': pd.Series.sum, 'Topic14': pd.Series.sum, 'Topic15': pd.Series.sum, 'Topic16': pd.Series.sum, 'Topic17': pd.Series.sum, 'Topic18': pd.Series.sum, 'Topic19': pd.Series.sum, 'Topic20': pd.Series.sum, 'Topic21': pd.Series.sum, 'Topic22': pd.Series.sum, 'Topic23': pd.Series.sum, 'Topic24': pd.Series.sum}, fill_value=0).reset_index()
compass_df_table = compass_df_table[['SurveyYear', 'OECD_STATUS', 'CountryLabel', 'Topic1','Topic2','Topic3','Topic4','Topic5','Topic6','Topic7','Topic8','Topic9','Topic10','Topic11','Topic12','Topic13','Topic14','Topic15','Topic16','Topic17','Topic18','Topic19','Topic20','Topic21','Topic22','Topic23','Topic24']].copy()
# print file
compass_df_table.to_excel(r'C:\Users\Gildas\OneDrive\OECD\test gensim\test_Count_word0.xlsx', index=False, sheet_name="Z")
# 2.2) OECD median normalisation
compass_df_table_OECD = compass_df_table[compass_df_table["OECD_STATUS"] == "OECD country"]
compass_df_table_OECD = pd.pivot_table(data=compass_df_table_OECD, index=['SurveyYear', 'OECD_STATUS'], values=['Topic1','Topic2','Topic3','Topic4','Topic5','Topic6','Topic7','Topic8','Topic9','Topic10','Topic11','Topic12','Topic13','Topic14','Topic15','Topic16','Topic17','Topic18','Topic19','Topic20','Topic21','Topic22','Topic23','Topic24'], aggfunc={'Topic1': pd.Series.median, 'Topic2': pd.Series.median, 'Topic3': pd.Series.median, 'Topic4': pd.Series.median, 'Topic5': pd.Series.median, 'Topic6': pd.Series.median, 'Topic7': pd.Series.median, 'Topic8': pd.Series.median, 'Topic9': pd.Series.median, 'Topic10': pd.Series.median, 'Topic11': pd.Series.median, 'Topic12': pd.Series.median, 'Topic13': pd.Series.median, 'Topic14': pd.Series.median, 'Topic15': pd.Series.median, 'Topic16': pd.Series.median, 'Topic17': pd.Series.median, 'Topic18': pd.Series.median, 'Topic19': pd.Series.median, 'Topic20': pd.Series.median, 'Topic21': pd.Series.median, 'Topic22': pd.Series.median, 'Topic23': pd.Series.median, 'Topic24': pd.Series.median}, fill_value=0).reset_index()
compass_df_table_OECD.rename(columns={'OECD_STATUS': 'CountryLabel'}, inplace=True)
compass_df_table_OECD_2017 = compass_df_table_OECD[compass_df_table_OECD["SurveyYear"] == 2017]
compass_df_table_OECD_2019 = compass_df_table_OECD[compass_df_table_OECD["SurveyYear"] == 2019]
compass_df_table_OECD_2021 = compass_df_table_OECD[compass_df_table_OECD["SurveyYear"] == 2021]
compass_df_table_2017 = compass_df_table[compass_df_table["SurveyYear"] == 2017]
compass_df_table_2017 = compass_df_table_2017.drop(['OECD_STATUS'], axis=1)
compass_df_table_2019 = compass_df_table[compass_df_table["SurveyYear"] == 2019]
compass_df_table_2019 = compass_df_table_2019.drop(['OECD_STATUS'], axis=1)
compass_df_table_2021 = compass_df_table[compass_df_table["SurveyYear"] == 2021]
compass_df_table_2021 = compass_df_table_2021.drop(['OECD_STATUS'], axis=1)
compass_df_table_OECD_2017 = compass_df_table_OECD_2017[compass_df_table_OECD_2017["SurveyYear"] == 2017]
compass_df_table_OECD_2017 = compass_df_table_OECD_2017.loc[compass_df_table_OECD_2017.index.repeat(52)]
compass_df_table_OECD_2017.reset_index(inplace = True)
compass_df_table_OECD_2017 = compass_df_table_OECD_2017.drop(['index'], axis=1)
compass_df_table_OECD_2017.index = np.arange(52, len(compass_df_table_OECD_2017) + 52)
columns_list = compass_df_table_2017.columns
compass_df_table_OECD_2017=compass_df_table_OECD_2017.reindex(columns= columns_list)
compass_df_table_OECD_2017.reset_index(inplace = True)
compass_df_table_OECD_2017 = compass_df_table_OECD_2017.drop(['index'], axis=1)
compass_df_table_2017.iloc[:,2:] = compass_df_table_2017.iloc[:,2:].divide(compass_df_table_OECD_2017.iloc[:,2:], axis=0)
compass_df_table_OECD_2019 = compass_df_table_OECD_2019[compass_df_table_OECD_2019["SurveyYear"] == 2019]
compass_df_table_OECD_2019 = compass_df_table_OECD_2019.loc[compass_df_table_OECD_2019.index.repeat(53)]
compass_df_table_OECD_2019.reset_index(inplace = True)
compass_df_table_OECD_2019 = compass_df_table_OECD_2019.drop(['index'], axis=1)
compass_df_table_OECD_2019.index = np.arange(52, len(compass_df_table_OECD_2019) + 52)
columns_list = compass_df_table_2019.columns
compass_df_table_OECD_2019=compass_df_table_OECD_2019.reindex(columns= columns_list)
compass_df_table_2019.iloc[:,2:] = compass_df_table_2019.iloc[:,2:].divide(compass_df_table_OECD_2019.iloc[:,2:], axis=0)
compass_df_table_OECD_2021 = compass_df_table_OECD_2021[compass_df_table_OECD_2021["SurveyYear"] == 2021]
compass_df_table_OECD_2021 = compass_df_table_OECD_2021.loc[compass_df_table_OECD_2021.index.repeat(59)]
compass_df_table_OECD_2021.reset_index(inplace = True)
compass_df_table_OECD_2021 = compass_df_table_OECD_2021.drop(['index'], axis=1)
compass_df_table_OECD_2021.index = np.arange(105, len(compass_df_table_OECD_2021) + 105)
columns_list = compass_df_table_2021.columns
compass_df_table_OECD_2021 =compass_df_table_OECD_2021.reindex(columns= columns_list)
compass_df_table_2021.iloc[:,2:] = compass_df_table_2021.iloc[:,2:].divide(compass_df_table_OECD_2021.iloc[:,2:], axis=0)
# append data by SurveyYear in a single dataframe
compass_df_table = compass_df_table_2017.append(compass_df_table_2019)
compass_df_table = compass_df_table.append(compass_df_table_2021)
# print file
#compass_df_table.to_excel(r'C:\Users\Gildas\OneDrive\OECD\test gensim\test_Count_word1.xlsx', index=False, sheet_name="Z")
This section presents four sets of charts to visualise the prevalence of topics and their evolutions across the survey editions. Large shifts over time should signify relative accrued or diminished attention put to a particular topic in the country mix of initiatives, as compared to the OECD median mix. As mentioned earlier, the OECD median mix is also evolving in each edition of the survey, as apparent in Figure 4 and 5.
The following two charts show the OECD median prevalence of each topic, ordered by decreasing prevalence.
The bar chart obtained in Figure 4 details the prevalence of all topics calculated by the model in the OECD median. Across the three surveys, 7 to 8 topics explain 80% of all initiatives' corpus. Figure 5 focuses only on the topics retained from the model. It shows that over 50% of initiatives' text corpus is about these topics. The topics are stable across all three surveys, but as we will see some countries display large changes in the way they emphasise topics across the survey editions.
# reorder dataframe by importance of topics
compass_df_table_OECD = compass_df_table_OECD.drop(['CountryLabel'], axis=1)
compass_df_table_OECD = compass_df_table_OECD.iloc[:,np.argsort(-compass_df_table_OECD.values[2])]
columns_list = compass_df_table_OECD.iloc[:,1:].columns
# aggregate the topics score by year:
compass_df_table_graph4 = compass_df_table_OECD #pd.pivot_table(data=compass_df_table_OECD, index=['SurveyYear'], values=['Topic1','Topic2','Topic3','Topic4','Topic5','Topic6','Topic7','Topic8','Topic9','Topic10','Topic11','Topic12','Topic13','Topic14','Topic15','Topic16','Topic17','Topic18','Topic19','Topic20','Topic21','Topic22','Topic23','Topic24'], aggfunc={'Topic1': pd.Series.median, 'Topic2': pd.Series.median, 'Topic3': pd.Series.median, 'Topic4': pd.Series.median, 'Topic5': pd.Series.median, 'Topic6': pd.Series.median, 'Topic7': pd.Series.median, 'Topic8': pd.Series.median, 'Topic9': pd.Series.median, 'Topic10': pd.Series.median, 'Topic11': pd.Series.median, 'Topic12': pd.Series.median, 'Topic13': pd.Series.median, 'Topic14': pd.Series.median, 'Topic15': pd.Series.median, 'Topic16': pd.Series.median, 'Topic17': pd.Series.median, 'Topic18': pd.Series.median, 'Topic19': pd.Series.median, 'Topic20': pd.Series.median, 'Topic21': pd.Series.median, 'Topic22': pd.Series.median, 'Topic23': pd.Series.median, 'Topic24': pd.Series.median}, fill_value=0).reset_index()
compass_df_table_graph4['total'] = compass_df_table_graph4[columns_list].sum(axis=1)
#compass_df_table_graph4=compass_df_table_graph4.reindex(columns= columns_list) #['SurveyYear','Topic1','Topic2','Topic3','Topic4','Topic5','Topic6','Topic7','Topic8','Topic9','Topic10','Topic11','Topic12','Topic13','Topic14','Topic15','Topic16','Topic17','Topic18','Topic19','Topic20','Topic21','Topic22','Topic23','Topic24', 'total'])
# compute proportion
for x in range(1, compass_df_table_graph4.shape[1]-1):
compass_df_table_graph4.iloc[:,x] = compass_df_table_graph4.iloc[:,x]/ compass_df_table_graph4['total']
# generate the graph
import plotly.graph_objects as px
import numpy as np
country =[2017,2019,2021] #country = compass_df_table_graph4['SurveyYear']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538'}
plot = px.Figure(data=[go.Bar(
name = 'Topic 13',
x = compass_df_table_graph4['Topic13'],
y = ["2017","2019","2021"],
marker_color=colors['A'],
orientation = 'h',
),
go.Bar(
name = 'Topic14',
x = compass_df_table_graph4['Topic14'],
y = ["2017","2019","2021"],
marker_color=colors['B'],
orientation = 'h',
),
go.Bar(
name = 'Topic3',
x = compass_df_table_graph4['Topic3'],
y = ["2017","2019","2021"],
marker_color=colors['C'],
orientation = 'h',
),
go.Bar(
name = 'Topic4',
x = compass_df_table_graph4['Topic4'],
y = ["2017","2019","2021"],
marker_color=colors['D'],
orientation = 'h',
),
go.Bar(
name = 'Topic5',
x = compass_df_table_graph4['Topic5'],
y = ["2017","2019","2021"],
marker_color=colors['E'],
orientation = 'h',
),
go.Bar(
name = 'Topic2',
x = compass_df_table_graph4['Topic2'],
y = ["2017","2019","2021"],
marker_color=colors['G'],
orientation = 'h',
),
go.Bar(
name = 'Topic12',
x = compass_df_table_graph4['Topic12'],
y = ["2017","2019","2021"],
marker_color=colors['H'],
orientation = 'h',
),
go.Bar(
name = 'Topic15',
x = compass_df_table_graph4['Topic15'],
y = ["2017","2019","2021"],
marker_color=colors['J'],
orientation = 'h',
),
go.Bar(
name = 'Topic7',
x = compass_df_table_graph4['Topic7'],
y = ["2017","2019","2021"],
marker_color=colors['K'],
orientation = 'h',
),
go.Bar(
name = 'Topic17',
x = compass_df_table_graph4['Topic17'],
y = ["2017","2019","2021"],
marker_color=colors['L'],
orientation = 'h',
),
go.Bar(
name = 'Topic8',
x = compass_df_table_graph4['Topic8'],
y = ["2017","2019","2021"],
marker_color=colors['M'],
orientation = 'h',
),
go.Bar(
name = 'Topic6',
x = compass_df_table_graph4['Topic6'],
y = ["2017","2019","2021"],
marker_color=colors['N'],
orientation = 'h',
),
go.Bar(
name = 'Topic1',
x = compass_df_table_graph4['Topic1'],
y = ["2017","2019","2021"],
marker_color=colors['O'],
orientation = 'h',
),
go.Bar(
name = 'Topic10',
x = compass_df_table_graph4['Topic10'],
y = ["2017","2019","2021"],
marker_color=colors['A'],
orientation = 'h',
),
go.Bar(
name = 'Topic11',
x = compass_df_table_graph4['Topic11'],
y = ["2017","2019","2021"],
marker_color=colors['B'],
orientation = 'h',
),
go.Bar(
name = 'Topic9',
x = compass_df_table_graph4['Topic9'],
y = ["2017","2019","2021"],
marker_color=colors['F'],
orientation = 'h',
),
go.Bar(
name = 'Topic 16',
x = compass_df_table_graph4['Topic16'],
y = ["2017","2019","2021"],
marker_color=colors['C'],
orientation = 'h',
)
])
plot.update_layout(
title={
'text': "Figure 4: Median share of OECD country topics prevalence, all topics",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
tickformat=".0%",
dtick=0.1,
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=12,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=500,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
plot.update_layout(barmode='stack')
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig4.html")
plot.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig4.png")
plot.show()
# generate the graph
import plotly.graph_objects as px
import numpy as np
country =[2017,2019,2021]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538'}
plot = px.Figure(data=[go.Bar(
name = 'Topic3',
x = compass_df_table_graph4['Topic13'],
y = ["2017","2019","2021"],
marker_color=colors['C'],
orientation = 'h',
),
go.Bar(
name = 'Topic4',
x = compass_df_table_graph4['Topic4'],
y = ["2017","2019","2021"],
marker_color=colors['D'],
orientation = 'h',
),
go.Bar(
name = 'Topic5',
x = compass_df_table_graph4['Topic5'],
y = ["2017","2019","2021"],
marker_color=colors['E'],
orientation = 'h',
),
go.Bar(
name = 'Topic12',
x = compass_df_table_graph4['Topic12'],
y = ["2017","2019","2021"],
marker_color=colors['G'],
orientation = 'h',
),
go.Bar(
name = 'Topic7',
x = compass_df_table_graph4['Topic7'],
y = ["2017","2019","2021"],
marker_color=colors['I'],
orientation = 'h',
),
go.Bar(
name = 'Topic17',
x = compass_df_table_graph4['Topic17'],
y = ["2017","2019","2021"],
marker_color=colors['K'],
orientation = 'h',
),
go.Bar(
name = 'Topic6',
x = compass_df_table_graph4['Topic6'],
y = ["2017","2019","2021"],
marker_color=colors['L'],
orientation = 'h',
)
])
plot.update_layout(
title={
'text': "Figure 5: Median share of OECD country topics prevalence, retained topics",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
tickformat=".0%",
dtick=0.1,
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=12,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=500,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
plot.update_layout(barmode='stack')
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig41.html")
plot.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig41.png")
plot.show()
The second set of visualisations focuses on each topic identified by the model from the pool of business innovation initiatives. Each chart compares how the eight selected countries emphasise a topic in the 2017 and 2021 datasets. In the following figures, the label "Topic less prevalent" marks topics that are 50% less frequent than the OECD median. Similarly, the label "Topic more prevalent" marks topics that are 50% more frequent compared to the OECD median.
# Reorder dataframe by importance of topics and trim out unwanted topics
columns_list = ['Topic3','Topic4','Topic5','Topic12','Topic7','Topic17','Topic6']
columns_list = ['SurveyYear'] + ['CountryLabel'] + columns_list
# Reorder dataframe by importance of topics and trim out unwanted topics
new_row = pd.DataFrame([['2017', 'OECD', 1,1,1,1,1,1,1], ['2019', 'OECD', 1,1,1,1,1,1,1],['2021', 'OECD', 1,1,1,1,1,1,1]], columns=['SurveyYear','CountryLabel','Topic3','Topic4','Topic5','Topic6','Topic7','Topic12','Topic17'])
compass_df_table=compass_df_table.reindex(columns= columns_list)
compass_df_table = compass_df_table.append(new_row, ignore_index=True)
compass_df_table["SurveyYear"] = compass_df_table["SurveyYear"].astype(int)
compass_df_table=compass_df_table.reindex(columns= columns_list)
# generate list of topics
topic_title={'SurveyYear':['Topic3','Topic4','Topic5','Topic6','Topic7','Topic12','Topic17'],'title':['Leveraging the RDI ecosystem', 'Access to finance', 'Entrepreneurship & business support services', 'Tackling societal challenges', 'Tax incentives', 'Stimulating demand for innovation','Industrial technology development']}
topic_title = pd.DataFrame(topic_title)
This is the most prevalent selected topic in the corpus. The distribution of countries is rather compact, ranging from 50% to 150% the OECD median, but countries show different dynamics. Compared to other countries, German and Korean business innovation policies most frequently emphasise linkages with the RDI ecosystem, followed by Brazil. In Germany, emphasis in 2021 has increased markedly vis-à-vis 2017. In other countries, topic prevalence is lower than OECD median, with Italy emphasising the topic the least, and Canada and France emphasising the topic less in 2021 compared to 2017.
### create a graph:
topic = 'Topic3'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
topic_title1
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=topic, aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 6: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 2),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig5.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig5.png")
fig.show()
Among the selected countries, the topic is most prevalent in the policies reported by Italy. This topic has been emphasised with increased frequency in the policies reported by Korea and the United Kingdom. By contrast, policies in France and (to a lesser extent) in Brazil shifted away from this topic in the 2021 edition of the survey. Topic prevalence for Canada and Germany remained stable (and low).
### create a graph:
topic = 'Topic4'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=[topic], aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.iloc[:,:]
compass_df_table_graph2.columns = compass_df_table_graph2.columns.droplevel(-2)
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Save order
Order = compass_df_table_graph2.reset_index().copy()
Order = compass_df_table_graph2["CountryLabel"]
Order = pd.DataFrame(Order)
Order = Order.rename(columns = {"CountryLabel" : topic})
Order = Order.reset_index(drop=True,).copy()
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 7: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 2),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig6.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig6.png")
fig.show()
This topic is most prevalent in British and Canadian policies, with Korean policies reaching the OECD median prevalence in 2021. It is least prevalent in Italian and German data (for the latter, it was around the median in 2017). Policies in other countries (France, Brazil and Australia) tackle entrepreneurship and business support services slightly below the OECD median.
### create a graph:
topic = 'Topic5'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=[topic], aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.iloc[:,:]
compass_df_table_graph2.columns = compass_df_table_graph2.columns.droplevel(-2)
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 8: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01,2.04),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig7.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig7.png")
fig.show()
Canada, the United Kingdom and Australia stand out for emphasising stimulating demand for innovation in their policy initiatives. Emphasis has increased by the policies reported by Canada, Korea (high prevalence) and (to a lesser extent) Brazil (close to OECD median). Topic prevalence is low in Italy.
### create a graph:
topic = 'Topic12'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=[topic], aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.iloc[:,:]
compass_df_table_graph2.columns = compass_df_table_graph2.columns.droplevel(-2)
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 9: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3.25),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig8.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig8.png")
fig.show()
Italy, France and (to a lesser extent) the United Kingdom report policies that stand out by emphasising this topic most recurrently, relative to other countries. These countries have increased their emphasis in the 2021 edition of the dataset. Topic prevalence is close to the OECD median in Korea and Australia. Among the selected countries, the topic is least prevalent in Germany, Brazil and Canada.
### create a graph:
topic = 'Topic7'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=[topic], aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.iloc[:,:]
compass_df_table_graph2.columns = compass_df_table_graph2.columns.droplevel(-2)
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 10: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig9.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig9.png")
fig.show()
Australian and Korean policies stand out for emphasising industrial technology development in a much larger proportion than OECD median. In Italy and France, the topic is less prevalent, and particularly so in the 2021 dataset. Other countries are stable around the OECD median.
### create a graph:
topic = 'Topic17'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=[topic], aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.iloc[:,:]
compass_df_table_graph2.columns = compass_df_table_graph2.columns.droplevel(-2)
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 11: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig10.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig10.png")
fig.show()
The topic is rather small compared to other topics, but the prevalence of this topic in the total corpus has increased between each survey edition. Societal challenges have become very prevalent in the policies reported by Canada, Korea and Germany. Prevalence of this topic has also grown in the United Kingdom. It has remained higher than the OECD median in France. Compared to 2017, the policies reported by Brazil and Italy increasingly emphasise societal challenges (though remain below the OECD median).
### create a graph:
topic = 'Topic6'
topic_title1 = topic_title.loc[topic_title['SurveyYear'] == topic]
topic_title1 = topic_title1.reset_index(drop=True, inplace=False)
topic_title1 = topic_title1['title'][0]
compass_df_table_graph2 = compass_df_table[['CountryLabel', 'SurveyYear',topic]].copy()
compass_df_table_graph2 = compass_df_table_graph2.loc[(compass_df_table_graph2['CountryLabel'] == 'Australia') | (compass_df_table_graph2['CountryLabel'] == 'United Kingdom') | (compass_df_table_graph2['CountryLabel'] == 'Brazil') | (compass_df_table_graph2['CountryLabel'] == 'Germany') | (compass_df_table_graph2['CountryLabel'] == 'France') | (compass_df_table_graph2['CountryLabel'] == 'Canada') | (compass_df_table_graph2['CountryLabel'] == 'Italy') | (compass_df_table_graph2['CountryLabel'] == 'Korea') | (compass_df_table_graph2['CountryLabel'] == 'OECD')]
compass_df_table_graph2["SurveyYear"] = compass_df_table_graph2["SurveyYear"].astype(int)
compass_df_table_graph2 = pd.pivot_table(data=compass_df_table_graph2, index=['CountryLabel'], columns=['SurveyYear'], values=[topic], aggfunc={topic: pd.Series.mean}, fill_value=0).reset_index()
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"" : "YearlyBudgetRange"})
compass_df_table_graph2 = compass_df_table_graph2.iloc[:,:]
compass_df_table_graph2.columns = compass_df_table_graph2.columns.droplevel(-2)
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
for x in range(1, compass_df_table_graph2.shape[1]):
compass_df_table_graph2.iloc[:,x] = compass_df_table_graph2.iloc[:,x] / compass_df_table_graph2["OECD"]
compass_df_table_graph2 = compass_df_table_graph2.T.reset_index()
new_header = compass_df_table_graph2.iloc[0]
compass_df_table_graph2 = compass_df_table_graph2[1:]
compass_df_table_graph2.columns = new_header
compass_df_table_graph2 = compass_df_table_graph2.rename(columns = {"YearlyBudgetRange" : "CountryLabel"})
compass_df_table_graph2["OECD"] = 1.0
compass_df_table_graph2 = compass_df_table_graph2[compass_df_table_graph2.CountryLabel != "OECD"]
# ordrer data by 2021 score
compass_df_table_graph2.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph2['CountryLabel']
Ratio17 = compass_df_table_graph2[2017]
Ratio19 = compass_df_table_graph2[2019]
Ratio21 = compass_df_table_graph2[2021]
OECD = compass_df_table_graph2['OECD']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
#colors = {'A':'#1E1F26','B':'#283655','C':'#1E656D','D':'#4D648D','E':'#D0E1F9'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 12: Prevalence of topic in countries’ declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text=topic_title1,
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig11.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig11.png")
fig.show()
We see that all of the countries included in this analysis address all of the seven topics:
This section highlights the difference between using semantic analysis and counting numbers of initiatives to assess the prevalence of a topic.
As shown in the table below, there is some overlap between the topics identified by the semantic analysis and the STIP survey themes (questions). However, most of the topics emerging from the analysis do not correspond to predefined themes in the survey. In other words, the semantic analysis identifies patterns emerging from the data that go beyond the survey structure.
| Topic (semantic analysis) | Policy theme (2021 STIP survey) |
|---|---|
| 1. Leveraging the research system | N.A. |
| 2. Access to finance | Access to finance |
| 3. Entrepreneurship and business support services | Partial overlap with: “Entrepreneurship capabilities and culture”, “Targeted support to young innovative enterprises” and “Non-financial support for business R&D and innovation |
| 4. Stimulating demand for innovation | Stimulating demand for innovation and market creation |
| 5. Tax incentives for R&D and innovation | N.A. |
| 6. Industrial technology development | N.A. |
| 7. Tackling societal challenges | N.A. |
Figure 13 below provides the number of initiatives by countries in STIP theme "Access to Finance" vs OECD median in the 2021 survey. Countries are ordered as per the 2021 prevalence of the topic "Access to Finance", visualised in Figure 7.
# retrieve the data for T38
compass_df10 = compass_df_full.copy()
compass_df10 = compass_df10[compass_df10["SurveyYear"] == 2021]
compass_df10 = compass_df10[compass_df10["TH38"] != 0]
compass_df10 = pd.pivot_table(data=compass_df10, index=['SurveyYear', 'OECD_STATUS', 'CountryLabel','InitiativeID'], values=['TH38'], aggfunc={'TH38': pd.Series.sum}, fill_value=0).reset_index()
compass_df10 = compass_df10.drop_duplicates().reset_index(inplace = False).copy()
#compass_df10
compass_df10 = pd.pivot_table(data=compass_df10, index=['SurveyYear', 'OECD_STATUS', 'CountryLabel','InitiativeID'], values=['TH38'], aggfunc={'TH38': pd.Series.count}, fill_value=0).reset_index()
compass_df10 = compass_df10.drop_duplicates().reset_index(inplace = False).copy()
compass_df10 = pd.pivot_table(data=compass_df10, index=['SurveyYear', 'OECD_STATUS', 'CountryLabel'], values=['TH38'], aggfunc={'TH38': pd.Series.sum}, fill_value=0).reset_index()
# compute OECD values
compass_df10_OCDE = compass_df10[compass_df10["OECD_STATUS"] == "OECD country"]
compass_df10_OCDE = pd.pivot_table(data=compass_df10_OCDE, index=['SurveyYear'], values=['TH38'], aggfunc={'TH38': pd.Series.median}, fill_value=0).reset_index()
compass_df10_OCDE = compass_df10_OCDE.rename(columns = {"TH38" : "OECD"})
compass_df10 = pd.merge(compass_df10,
compass_df10_OCDE,
on ='SurveyYear',
how ='inner')
# normalise by OECD score
compass_df10['TH38'] = compass_df10['TH38'] / compass_df10['OECD']
compass_df10 = compass_df10.drop(['OECD'], axis=1)
compass_df10 = pd.merge(compass_df10,
my_countries['CountryLabel'],
on ='CountryLabel',
how ='inner')
# ordrer data by Topic score
Order1 = Order.reset_index(inplace = False).copy()
Order1 = Order1.rename(columns = {"Topic4" : "CountryLabel","index": "Order"})
columns_list = ['CountryLabel','Order']
#Order=Order.reindex(columns= columns_list)
compass_df10 = pd.merge(compass_df10,
Order1,
on ='CountryLabel',
how ='inner')
compass_df10.sort_values(by=['Order'], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df10['CountryLabel']
Ratio21 = compass_df10['TH38']
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 13: Count of initiatives flagged under 'Access to finance for innovation'",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text="",
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 4),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig11.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig11.png")
fig.show()
When comparing the Figure 13 with Figure 7 obtained from the semantic analysis, results are very different. The semantic analysis measures how much the topic is raised across all initiatives reported in the business innovation policy area, whereas the initiative count relies on initiatives tagged under the theme (question in the survey). Initiatives often deal with access to finance but are primarily addressed at other themes in the survey policy area (e.g. targeted support for SMEs, national strategies). As a result, while the main initiatives are tagged under a given policy theme, that theme does not capture how all reported policies emphasise a given topic in the text.
This section visualises how each country emphasises the seven topics identified by the analysis, relative to the OECD median.
In the charts, the label "Topic less prevalent" marks topics that are 50% less frequent than the OECD median. Similarly, the label "Topic more prevalent" marks topics that are 50% more frequent compared to the OECD median.
57% of the text data reported in 2021 Australian policies can be associated with the seven topics identified by the model, down from 63% in 2017. The topic profile of Australia is characterised by a very high emphasis put on initiatives targeting Industrial technology development and stimulating demand through innovation. Otherwise, most of Australia’s topics in 2021 are rather stable and follow closely those of the OECD median. Relative to other topics, leveraging the RDI ecosystem is the least frequent topic.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'Australia') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 14: Prevalence of topics in Australia's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.1, 3.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig12.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig12.png")
fig.show()
43% of the text data reported in 2021 Brazilian policies can be associated with the seven topics identified by the model, down from 47% in 2017. Brazil, a non-OECD country, declared more initiatives for innovation entrepreneurship than the OECD median, but scored less than OECD median on five of the seven topics. It may be that Brazilian policies tackle a distinct set of topics beyond those identified by the model. The exceptions are leveraging the RDI ecosystem and stimulating demand for innovation, which are marginally more prevalent than OECD median. Nevertheless, Brazil improved its relative attention to six of the seven topics, which may be explained by the growth in initiatives. Notably, relative to 2017, the policies reported in 2021 further emphasise stimulating demand for innovation and tackling societal challenges.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'Brazil') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 15: Prevalence of topics in Brazil's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 2),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0.0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig13.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig13.png")
fig.show()
56% of the text data reported in 2021 Canadian policies can be associated with the seven topics identified by the model, down from 61% in 2017. Overall, Canada has a rather different profile than OECD median. Canadian policies stand out for emphasising stimulating demand through innovation and tackling societal challenges (both of these displaying additional emphasis in 2021). The topics of tax incentives, access to finance and leveraging the RDI ecosystem come up less frequently compared to the OECD median, a feature reinforced in the 2021 survey.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'Canada') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 16: Prevalence of topics in Canada's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig14.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig14.png")
fig.show()
51% of the text data reported in 2021 French policies can be associated with the seven topics identified by the model, down from 57% in 2017. Compared to some of the other countries in this analysis, France’s business innovation policy mix is more distinct relative to the OECD median. By far the most prevalent topic in France is tax incentives, followed by stimulating demand through innovation and tackling societal challenges. The least prevalent is industrial technology development. Relative to 2017, French policies reported in the 2021 survey emphasise less the topics of access to finance.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'France') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 17: Prevalence of topics in France's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 2.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0,0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig15.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig15.png")
fig.show()
47% of the text data reported in 2021 German policies can be associated with the seven topics identified by the model, down from 58% in 2017. As for France, Germany displays a distinct policy mix relative to the OECD median. Policies place emphasis in tackling societal challenges and leveraging the RDI ecosystem. Such emphasis has increased in the 2021 edition of the STIP survey, particularly for the former. By contrast, business innovation policies reported by Germany emphasise relatively less frequently the topics of tax incentives, entrepreneurship and business support services and access to finance.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'Germany') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
#compass_df_table_graph1 = compass_df_table_graph1.sort_index(ascending=False) # alternative by index (country name)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
#colors = {'A':'#1E1F26','B':'#283655','C':'#1E656D','D':'#4D648D','E':'#D0E1F9'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 18: Prevalence of topics in Germany's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 2),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0,0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig16.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig16.png")
fig.show()
49% of the text data reported in 2021 Italian policies can be associated with the seven topics identified by the model, down from 56% in 2017. Italy also has quite an atypical profile. All topic frequencies in Italy diverge from the OECD median. As it was the case for Brazil, five out of seven topics are relatively less prevalent. Italy’s policy initiatives most recurrently emphasise tax incentives and access to finance. In 2021, Italy increased the mentions of societal challenges in its business innovation policies (relative to other topics). The opposite trend is observed for industrial technology development. Other topic frequencies have remained stable in the 2017 and 2021 editions of the survey.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'Italy') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 19: Prevalence of topics in Italy's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig17.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig17.png")
fig.show()
48% of the text data reported in 2021 Korean policies can be associated with the seven topics identified by the model, slightly down from 49% in 2017. In Korea, most topic frequencies have changed significantly between 2017 and 2021, which may be explained by the growth in reported initiatives. In 2021, all topics are more prevalent than the OECD median, with industrial technology development being the topic most emphasised by Korean policies. Mentions of societal challenges have particularly increased further beyond the OECD median. Recent policies have also increased the emphasis on stimulating demand through innovation.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'Korea') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 20: Prevalence of topics in Korea's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig18.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig18.png")
fig.show()
64% of the text data reported in 2021 British policies can be associated with the seven topics identified by the model, up from 62% in 2017. British policies stand out for frequently raising stimulating demand for innovation, followed by entrepreneurship and business support services. The topics of tax incentives and tackling societal challenges display additional emphasis in the 2021 survey edition. Relative to the OECD median, British policies less frequently mention the topic of leveraging the RDI ecosystem.
### select data for selected country
compass_df_table_graph1 = compass_df_table.loc[(compass_df_table['CountryLabel'] == 'United Kingdom') | (compass_df_table['CountryLabel'] == 'OECD')]
compass_df_table_graph1 = compass_df_table_graph1.T.reset_index()
new_header = compass_df_table_graph1.iloc[0]
compass_df_table_graph1 = compass_df_table_graph1[2:]
compass_df_table_graph1.columns = new_header
compass_df_table_graph1 = compass_df_table_graph1.astype(float, errors='ignore')
for x in range(1, 4):
compass_df_table_graph1.iloc[:,x] = compass_df_table_graph1.iloc[:,x] / compass_df_table_graph1.iloc[:,x+3]#
compass_df_table_graph1 = compass_df_table_graph1.iloc[: , :-3]
compass_df_table_graph1["OECD"] = 1
compass_df_table_graph1 = compass_df_table_graph1.reset_index()
compass_df_table_graph1.sort_values(by=['index'], inplace=True, ascending=False)
compass_df_table_graph1 = compass_df_table_graph1.drop(['index'], axis=1)
# technical change: replace 0 with non zero value so it appears as 0%
for y in range(0,6):
for x in range(0, 5):
if compass_df_table_graph1.iloc[:,x][y] == 0:
compass_df_table_graph1.iloc[:,x][y] = 0.00275
# add title to dataframe
compass_df_table_graph1 = compass_df_table_graph1.sort_index()
compass_df_table_graph1 = pd.merge(compass_df_table_graph1,
topic_title,
on =['SurveyYear'],
how ='inner')
# ordrer data by 2021 score
compass_df_table_graph1.sort_values(by=[2021], inplace=True, ascending=True)
# Generate the graph:
import plotly.graph_objects as go
country = compass_df_table_graph1['title']
Ratio17 = compass_df_table_graph1[2017]
Ratio19 = compass_df_table_graph1[2019]
Ratio21 = compass_df_table_graph1[2021]
OECD = compass_df_table_graph1["OECD"]
colors = {'Null': '#ffffff','A':'#594d45','B':'#79533d','C':'#595142','D':'#745d46','E':'#ac7e54','F':'#a9947b','G':'#b49d80','H':'#bcab90','I':'#535640','J':'#54504b','K':'#63613e','L':'#4a5444','M':'#5c5c5b','N':'#9495a5','O':'#373538','Z':'#000000'}
fig = go.Figure()
fig.add_trace(go.Scatter(
x=Ratio17,
y=country,
name='2017 Survey',
marker_color=colors['G'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
#fig.add_trace(go.Scatter(
# x=Ratio19,
# y=country,
# name='2019 Survey',
# marker_color=colors['D'],
# marker=dict(
# size=13,
# cmax=1,
# cmin=0,
# line_color="#000000",
# )
#))
fig.add_trace(go.Scatter(
x=Ratio21,
y=country,
name='2021 Survey',
marker_color=colors['A'],
marker=dict(
size=13,
cmax=1,
cmin=0,
line_color="#000000",
)
))
fig.update_traces(mode='markers', marker=dict(line_width=1, symbol='circle', size=13))
fig.update_layout(
title={
'text': "Figure 21: Prevalence of topics in the UK's declared objectives vs OECD Median",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
title_font_color="#000000",
title_font_family="verdana",
xaxis=dict(
showgrid=False,
showline=True,
linecolor="#000000",
tickfont_color="#000000",
showticklabels=True,
dtick=0.5,
tickformat=".0%",
ticks='outside',
tickcolor="#000000",
),
margin=dict(l=140, r=40, b=50, t=100),
legend=dict(
font_size=10,
y=0.5,
#yanchor='middle',
xanchor='left',
font=dict(color="black"),
),
width=1000,
height=400,
paper_bgcolor='white',
plot_bgcolor='white',
hovermode='closest',
)
fig.update_yaxes(
tickfont=dict(family="verdana", color="#000000", size=13)
)
fig.update_xaxes(
title_text='Topic prevalence in country initiatives',
title_font_family="verdana",
title_font_size=17,
title_font_color="#000000",
fixedrange=True,
range=(-0.01, 3.5),
ticktext=["","Topic less<br>prevalent", "OECD<br>Median", "Topic more<br>prevalent"],
tickvals=[0, 0.5, 1, 1.5],
showgrid=True,
gridcolor="#444",
# gridfont=dict(family="Open Sans"),
ticks="outside",
tickson="boundaries",
tickcolor="#444",
tickfont=dict(family="verdana", color="#000000", size=13)
)
#print the graph
#fig.write_html(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig19.html")
fig.write_image(r"C:\Users\Gildas\OneDrive\OECD\Presentation_graphs\fig19.png")
fig.show()
Natural language processing provides a way of analysing the large volumes of unstructured data reported in the STIP databases: 350k words on policy objectives around support for business innovation. This approach aims to systematically summarise textual data, allow cross-country comparisons and to observe shifts in policy trends over time. It used a machine learning model to identify key topics and measure how frequently countries raise them in their policies, giving a sense of orientation of national policy mixes without measuring policy effort. Determining the latter would require additional harmonisation of the budget data reporting in the STIP survey.
While the robustness checks performed in this pilot suggest that the model is coherently clustering initiatives under salient topics, the results presented here require further validation. In particular, STIP survey national contact points and OECD CSTP country delegates could assess how accurately the model represents their data (and the reality of their national business innovation policy). We plan to reach out to these national experts to learn whether the data points and trends presented in this pilot align with how they perceive their business innovation policy mix.